oh wow, I didn't see delete_artifacts_and_models option
I guess we'll have to manually find old artifacts that are related to already deleted tasks
we already have cleanup service set up and running, so we should be good from now on
two more questions about cleanup if you don't mind:
what if for some old tasks I get WARNING:root:Could not delete Task ID=a0908784a2a942c3812f947ec1f32c9f, 'Task' object has no attribute 'delete'? What's the best way of cleaning them? What is the recommended way of providing S3 credentials to cleanup task?
we're using the latest version of clearml, clearml agent and clearml server, but we've been using trains/clearml for 2.5 years, so there are some old tasks left, I guess 😃
I guess, this could overcomplicate ui, I don't see a good solution yet.
as a quick hack, we can just use separate name (eg "best_val_roc_auc") for all metric values for the current best checkpoint. then we can just add columns with the last value of this metric
not necessarily, there are rare cases when container keeps running after experiment is stopped or aborted
will do!
that's right, I have 4 GPUs and 4 workers. but what if I want to run two jobs simultaneously at the same GPU
thanks for the link advice, will do
I'll let you know if I managed to achieve my goals with StorageManager
I don't think so because max value of each metric is calculated independently of other metrics
so max values that I get can be reached at the different epochs
on the side note, is there any way to automatically give more meaningful names to the running docker containers?
I guess I could manually explore different containers and their content 😃 as far as I remember, I had to update Elastic records when we moved to the new cloud provider in order to update model URLs
sorry, my bad, after some manipulations I made it work. I have to manually change HTTP to HTTPS in config file for Web and Files (not API) server after initialization, but besides that it works
I change the arguments in Web UI, but it looks like they are not parsed by trains
what if cleanup service is launched using ClearML-Agent Services container (part of the ClearML server)? adding clearml.conf to the home directory doesn't help
right now we can pass github secrets to the clearml agent training containers ( CLEARML_AGENT_GIT_PASS) to install private repos
we need a way to pass secrets to access our database with annotations
that was tough but I finally manage to make it working! thanks a lot for your help, I definitely wouldn't be able to do it without you =)
the only problem that I still encounter is that sometimes there are random errors in the beginning of the runs, especiialy when I enqueue multiple experiments at the same time (I have 4 workers for 4 GPUs).
for example, this
from torch.utils.tensorboard import SummaryWrite
writer = SummaryWriter()
sometimes randomly leads to FileNotFoundError: [Errno...
that's right
for example, there are tasks A, B, C
we run multiple experiments for A, finetune some of them in separate tasks, then choose one or more best checkpoints, run some experiments for task B, choose the best experiment, and finally run task C
so we get a chain of tasks: A - A-ft - B- C
ClearML pipeline doesn't quite work here because we would like to analyze results of each step before starting next task
but it would be great to see predecessors of each experiment in the chain
Error 12 : Validation error (value ‘['13b46b9325954517ab99381d5f45237d’, ‘bc76c3a7f0f6431b8e064212e9bdd2c0’, ‘5d2a57cd39b94250b8c8f52303ccef92’, ‘e4731ee5b33e41d992d6d3fdb2913045’, ‘698d9231155e41fbb61f8f3faa605727’, ‘2171b190507f40d1be35e222045c58ea’, ‘55c81a5db0ad40bebf72fdcc1b3be2a4’, ‘94fbdbe26ef242d793e18d955cb3de58’, ‘7d8a6c8f2ae246478b39ae5e87def2ad’, ‘141594c146fe495886d477d9a27c465f’, ‘640f87b02dc94a4098a0aba4d855b8f5’]' length is bigger than allowed maximum ‘10’.)
not sure what you mean. I used to do task.set_initial_iteration(task.get_last_iteration()) in the task resuming script, but in the training code I explicitly pass global_step=epoch to the TensorBoard writer
well okay, it's probably not that weird considering that worker just runs the container
still no luck, I tried everything =( any updates?