DilapidatedParrot58

42 Questions, 205 Answers

Active since 10 January 2023

Last activity one year ago

Reputation

Badges 1

186 × Eureka!

Questions 42
Answers 205

0 Votes

29 Answers

991 Views

0 Votes 29 Answers 991 Views

I'M Using Tensorboard Summarywriter To Add Scalar Metrics For The Experiment. If Experiment Crashed, And I Want To Continue It From Checkpoint, For Some Reason It Plots Metrics In A Really Weird Way. Even Though I Pass Global_Step=Epoch To The Summarywrit

I'm using Tensorboard SummaryWriter to add scalar metrics for the experiment. if experiment crashed, and I want to continue it from checkpoint, for some reas...

clearml

3 years ago

0 Votes

11 Answers

1K Views

0 Votes 11 Answers 1K Views

Hey Guys, Do You Have Any Plans To Add Functionality To Export Training Config With All Hyperparameters To The Different Formats, Such As Training Command Line Command, Yaml, Etc.?

hey guys, do you have any plans to add functionality to export training config with all hyperparameters to the different formats, such as training command li...

clearml

4 years ago

0 Votes

2 Answers

976 Views

0 Votes 2 Answers 976 Views

After Recent Clearml Server Update, Whenever I Clone An Experiment, The Default Project For The Draft Copy Is The First Project In The List. Previously, It Would Be The Project Which I Am Cloning This Experiment From. This Was Much More Convenient. Is Thi

after recent clearml server update, whenever I clone an experiment, the default project for the draft copy is the first project in the list. previously, it w...

clearml

one year ago

0 Votes

8 Answers

1K Views

0 Votes 8 Answers 1K Views

Downloading Output Artifacts From S3 By Clicking On The Download Button Next To Model Url Was Great, But Since We Moved From Aws To Yandex.Cloud, This Feature Doesn'T Work. Any Chance You Could Support Other Cloud Providers?

downloading output artifacts from S3 by clicking on the download button next to Model URL was great, but since we moved from AWS to Yandex.Cloud, this featur...

clearml

2 years ago

0 Votes

20 Answers

1K Views

0 Votes 20 Answers 1K Views

Hey Guys, I'M Trying To Run An Experiment Using Trains-Agent. I Have A Custom Docker Image With Nightly Versions Of Pytorch And Our Own Library Installed From A Private Repo. I Was Assuming That These Packages Will Be Automatically Available To Trains Dur

hey guys, I'm trying to run an experiment using trains-agent. I have a custom Docker image with nightly versions of pytorch and our own library installed fro...

pytorch

4 years ago

0 Votes

27 Answers

1K Views

0 Votes 27 Answers 1K Views

Hey Guys, I Keep Getting

hey guys, I keep getting trains_agent: ERROR: Connection Error: it seems *api_server* is misconfigured. Is this the TRAINS API server http://apiserver:8008 ?...

clearml

3 years ago

0 Votes

25 Answers

1K Views

0 Votes 25 Answers 1K Views

I'M Probably Stupid, But How Do I Specify Worker Name? Usecase - I Want To Create Two Workers Using The Same Gpu, And New Worker Just Overwrites The Old One

I'm probably stupid, but how do I specify worker name? usecase - I want to create two workers using the same GPU, and new worker just overwrites the old one

clearml

4 years ago

0 Votes

11 Answers

1K Views

0 Votes 11 Answers 1K Views

Hey Guys, Is There A Ready Script That Can Delete All Models From S3 (Or Other Storage) That Are Related To Deleted Or Archived Experiments?

hey guys, is there a ready script that can delete all models from S3 (or other storage) that are related to deleted or archived experiments?

clearml

3 years ago

0 Votes

7 Answers

980 Views

0 Votes 7 Answers 980 Views

There Is Something Weird Going On With Console Log After Latest Updates Of Clearml Server. It Doesn'T Show The Latest Updates, Instead It Often Jumps To The Seemingly Random Parts Of The Console Output

there is something weird going on with console log after latest updates of ClearML Server. it doesn't show the latest updates, instead it often jumps to the ...

clearml

one year ago

0 Votes

13 Answers

1K Views

0 Votes 13 Answers 1K Views

It Would Be Nice To Group Experiments Within Projects Use Cases:

it would be nice to group experiments within projects use cases: hyperparameter sweep (10 experiments with different learning rate) finetuning models (for ex...

clearml

2 years ago

0 Votes

7 Answers

982 Views

0 Votes 7 Answers 982 Views

When We Train The Models, We Often Choose Checkpoint Based On The Validation Accuracy, But Test Set Accuracy (Or Specific Class Validation Accuracy) Is Not Necessarily The Best For This Checkpoint. Right Now There Are Options To Add Columns With Max And L

when we train the models, we often choose checkpoint based on the validation accuracy, but test set accuracy (or specific class validation accuracy) is not n...

clearml

3 years ago

0 Votes

7 Answers

1K Views

0 Votes 7 Answers 1K Views

Any Chance Storagemanager Could Re-Download Files Only If Their Size Is Different From File In Cache (As An Option)?

any chance StorageManager could re-download files only if their size is different from file in cache (as an option)?

clearml

3 years ago

Show more results

0 I Updated Trains-Server Today, And Now It'S Very Unstable, Web Interface Randomly Stops Working. Anyone Had The Same Problem? I'Ve Never Had Any Problems With Updating The Server Before

as a sidenote, I am not able to pull the newest release, looks like it's not pushed?
"Error response from daemon: manifest for allegroai/trains:0.14.2 not found"

4 years ago

0 I Updated Trains-Server Today, And Now It'S Very Unstable, Web Interface Randomly Stops Working. Anyone Had The Same Problem? I'Ve Never Had Any Problems With Updating The Server Before

I've done it many times, using different devices. sometimes it works, sometimes it doesn't

4 years ago

0 I Updated Trains-Server Today, And Now It'S Very Unstable, Web Interface Randomly Stops Working. Anyone Had The Same Problem? I'Ve Never Had Any Problems With Updating The Server Before

I assume, temporary fix is to switch to trains-server?

4 years ago

0 There Is Something Weird Going On With Console Log After Latest Updates Of Clearml Server. It Doesn'T Show The Latest Updates, Instead It Often Jumps To The Seemingly Random Parts Of The Console Output

thanks! this bug and cloning problem seem to be fixed

one year ago

0 I'M Using Tensorboard Summarywriter To Add Scalar Metrics For The Experiment. If Experiment Crashed, And I Want To Continue It From Checkpoint, For Some Reason It Plots Metrics In A Really Weird Way. Even Though I Pass Global_Step=Epoch To The Summarywrit

overwriting this value is not ideal though, because for :monitor:gpu and :monitor:machine values I would like to continue from the latest iteration

but for the metrics, I explicitly pass the number of epoch that my training is currently on. it'ls kind of weird that it adds offset to the values that are explicitly reported, no?

3 years ago

okay, so if there’s no workaround atm, should I create a Github issue?

3 years ago

I use Docker for training, which means that log_dir contents are removed for the continued experiment btw

3 years ago

maybe I should use explicit reporting instead of Tensorboard

3 years ago

still no luck, I tried everything =( any updates?

3 years ago

okay, I will open an issue

3 years ago

0 I Updated Trains-Server Today, And Now It'S Very Unstable, Web Interface Randomly Stops Working. Anyone Had The Same Problem? I'Ve Never Had Any Problems With Updating The Server Before

dnk if it's relevant, but I also added a new user to apiserver.conf today

4 years ago

0 I Updated Trains-Server Today, And Now It'S Very Unstable, Web Interface Randomly Stops Working. Anyone Had The Same Problem? I'Ve Never Had Any Problems With Updating The Server Before

working on the logs

4 years ago

0 I Updated Trains-Server Today, And Now It'S Very Unstable, Web Interface Randomly Stops Working. Anyone Had The Same Problem? I'Ve Never Had Any Problems With Updating The Server Before

I've already pulled new images from trains-server, let's see if the initial issue occurs again. thank for the fast response guys!

4 years ago

0 Step 3 Task (

https://allegro.ai/clearml/docs/docs/examples/pipeline/pipeline_controller.html

3 years ago

0 When We Train The Models, We Often Choose Checkpoint Based On The Validation Accuracy, But Test Set Accuracy (Or Specific Class Validation Accuracy) Is Not Necessarily The Best For This Checkpoint. Right Now There Are Options To Add Columns With Max And L

I don't think so because max value of each metric is calculated independently of other metrics

3 years ago

so max values that I get can be reached at the different epochs

3 years ago

0 Feature Request: Clearml Prints Github Token In The Log, When There Is "Repository Not Found" Error. It Would Be Nice If Could Hide It

just DMed you a screenshot where you can see a part of the token

3 years ago

0 Feature Request: Clearml Prints Github Token In The Log, When There Is "Repository Not Found" Error. It Would Be Nice If Could Hide It

in order to use private repositories for our experiments I add agent.git_user and agent.git_pass options to clearml.conf when launching agents

if someone accidentally tries to launch an experiment from non-existing repo, ClearML will print
fatal: repository ' https://username:token@github.com/our_organization/non_existing_repo.git/ ' not found

exposing the real token

3 years ago

does this mean that setting initial iteration to 0 should help?

3 years ago

0 We Have A Use Case Where An Experiment Consists Of Multiple Docker Containers. For Example, One Container Works On Cpu Machine, Preprocesses Images And Puts Them Into Queue. The Second One (Main One) Resides On Gpu Machine, Reads Tensors And Targets From

yes, this is the use case, I think we can use smth like Redis for this communication

2 years ago

yeah, we've used pipelines in other scenarios. might be a good fit here. thanks!

2 years ago

0 Hey Guys, I Keep Getting "Failed Parsing Task Parameter" Warning For The Arguments Such As This One:

we have a baremetal server with ClearML agents, and sometimes there are hanging containers or containers that consume too much RAM. unless I explicitly add container name in container arguments, it will have a random name, which is not very convenient. it would be great if we could set default container name for each experiment (e.g., experiment id)

3 years ago

0 Is There Any Way To Export Csv With Max Metrics And Hyperparameters For Selected Experiments?

kind of

3 years ago

0 Is There Any Way To Post Slack Alerts For The Frozen Experiments? (Eg, After Server Restart They Sometimes Get Stuck In Running Mode, Or

yeah, that sounds right! thanks, will try

3 years ago

0 Is There Any Way To Post Slack Alerts For The Frozen Experiments? (Eg, After Server Restart They Sometimes Get Stuck In Running Mode, Or

for me, increasing shm-size usually helps. what does this RC fix?

3 years ago

0 Hey Guys! I'Ve Got The Latest Version Of Trains 0.16.0 And Now I Have A Problem. In Previous Versions I Could Easily Override Default Arguments On Hyperparameters Tab And Now After Editing The Arguments Values With The New Ones And Executing The Experime

I updated the version in the Installed packages section before starting the experiment

4 years ago

0 Downloading Output Artifacts From S3 By Clicking On The Download Button Next To Model Url Was Great, But Since We Moved From Aws To Yandex.Cloud, This Feature Doesn'T Work. Any Chance You Could Support Other Cloud Providers?

https://cloud.yandex.com/en-ru/docs/storage/s3/

2 years ago

0 Here I Am Again... Can'T Find How To Create A Custom Queue

LOL
wow 😃
I was trying to find how to create a queue using CLI 😃

4 years ago

0 Hey Guys, I'M Trying To Run An Experiment Using Trains-Agent. I Have A Custom Docker Image With Nightly Versions Of Pytorch And Our Own Library Installed From A Private Repo. I Was Assuming That These Packages Will Be Automatically Available To Trains Dur

great, this helped, thanks! I simply added https://download.pytorch.org/whl/nightly/cu101/torch_nightly.html to trains.conf, and it seems to be working

I now have another problem, my code is looking for some additional files in the root folder of the project. I tried adding a Docker layer:
ADD file.pkl /root/.trains/venvs-builds/3.6/task_repository/project.git/extra_data/

but trains probably rewrites the folder when cloning the repo. is there any workaround?

4 years ago

it also happens sometimes during the run when tensorboard is trying to write smth to the disk and there are multiple experiments running. so it must be smth similar to the scenario you're describing, but I have no idea how it can happen since I'm running four separate workers

4 years ago

Show more results