DilapidatedParrot58

this is the artifactory, this is how I install these packages in the Docker image:
pip3 install --pre torch torchvision -f https://download.pytorch.org/whl/nightly/cu101/torch_nightly.html

the files are used for training and evaluation (e.g., precomputed pycocotools meta-info). I could theoretically include them in the repo, but some of them might be quite heavy. what do you mean when you say that they get lost? I copy them from the host machine when I build the custom image, so they are i...

4 years ago

0 Yo Guys, I'M Getting

nope

4 years ago

0 Is There Any Way To Post Slack Alerts For The Frozen Experiments? (Eg, After Server Restart They Sometimes Get Stuck In Running Mode, Or

yeah, that sounds right! thanks, will try

3 years ago

0 Is There Any Way To Post Slack Alerts For The Frozen Experiments? (Eg, After Server Restart They Sometimes Get Stuck In Running Mode, Or

for me, increasing shm-size usually helps. what does this RC fix?

3 years ago

I updated the version in the Installed packages section before starting the experiment

4 years ago

0 Downloading Output Artifacts From S3 By Clicking On The Download Button Next To Model Url Was Great, But Since We Moved From Aws To Yandex.Cloud, This Feature Doesn'T Work. Any Chance You Could Support Other Cloud Providers?

https://cloud.yandex.com/en-ru/docs/storage/s3/

2 years ago

0 Here I Am Again... Can'T Find How To Create A Custom Queue

LOL
wow 😃
I was trying to find how to create a queue using CLI 😃

4 years ago

great, this helped, thanks! I simply added https://download.pytorch.org/whl/nightly/cu101/torch_nightly.html to trains.conf, and it seems to be working

I now have another problem, my code is looking for some additional files in the root folder of the project. I tried adding a Docker layer:
ADD file.pkl /root/.trains/venvs-builds/3.6/task_repository/project.git/extra_data/

but trains probably rewrites the folder when cloning the repo. is there any workaround?

4 years ago

it also happens sometimes during the run when tensorboard is trying to write smth to the disk and there are multiple experiments running. so it must be smth similar to the scenario you're describing, but I have no idea how it can happen since I'm running four separate workers

4 years ago

that was tough but I finally manage to make it working! thanks a lot for your help, I definitely wouldn't be able to do it without you =)

the only problem that I still encounter is that sometimes there are random errors in the beginning of the runs, especiialy when I enqueue multiple experiments at the same time (I have 4 workers for 4 GPUs).
for example, this
from torch.utils.tensorboard import SummaryWrite
writer = SummaryWriter()
sometimes randomly leads to FileNotFoundError: [Errno...

4 years ago

0 Any Chance Storagemanager Could Re-Download Files Only If Their Size Is Different From File In Cache (As An Option)?

yes

3 years ago

0 Hi

AnxiousSeal95 yeah, got it! thanks!

3 years ago

thanks for the link advice, will do
I'll let you know if I managed to achieve my goals with StorageManager

4 years ago

0 Hi

wow, thanks, just updated our server!
can't seem to find these metrics snapshot plots =) how do I plot one?

3 years ago

0 Hey Guys, Do You Have Any Plans To Add Functionality To Export Training Config With All Hyperparameters To The Different Formats, Such As Training Command Line Command, Yaml, Etc.?

not necessarily, command usually stays the same irrespective of the machine

4 years ago

0 Hey Guys, I Keep Getting "Failed Parsing Task Parameter" Warning For The Arguments Such As This One:

done!

3 years ago

0 Hey Guys, Thanks For Creating Slack Workspace, That'S Really Cool. Question - Are We Missing Smth Or Is Currently Not Possible To Pass S3 Credentials Via Env Variables? We Forked Trains And Added A Simple Fix (

awesome!

4 years ago

0 When We Train The Models, We Often Choose Checkpoint Based On The Validation Accuracy, But Test Set Accuracy (Or Specific Class Validation Accuracy) Is Not Necessarily The Best For This Checkpoint. Right Now There Are Options To Add Columns With Max And L

exactly

3 years ago

so max values that I get can be reached at the different epochs

3 years ago

0 Is Is Possible To Pass Custom

right now we can pass github secrets to the clearml agent training containers ( CLEARML_AGENT_GIT_PASS) to install private repos

we need a way to pass secrets to access our database with annotations

2 years ago

0 Yo Guys, I'M Getting

works like a charm! you guys are the best, as always =)

4 years ago

I guess, this could overcomplicate ui, I don't see a good solution yet.

as a quick hack, we can just use separate name (eg "best_val_roc_auc") for all metric values for the current best checkpoint. then we can just add columns with the last value of this metric

3 years ago

I don't think so because max value of each metric is calculated independently of other metrics

3 years ago

0 Two Annoying Visual Bugs In Clearml Server Ui After Latest Update:

nice, thanks for the info

2 years ago

0 Hey Guys, I Keep Getting

well, the server wouldn't work without them?

3 years ago

Show more results