original task name contains double space -> saved checkpoint also contains double space -> MODEL URL field in model description of this checkpoint in ClearML converts double space into single space. so when you copy & paste it somewhere, it'll be incorrect
sounds like an overkill for this problem, but I don’t see any other pretty solution 😃
Requirement already satisfied (use --upgrade to upgrade): celsusutils==0.0.1
yeah, server (1.0.0) and client (1.0.1)
we often do ablation studies with more than 50 experiments, and it was very convenient to compare their dynamics at the different epochs
we already have cleanup service set up and running, so we should be good from now on
what if cleanup service is launched using ClearML-Agent Services container (part of the ClearML server)? adding clearml.conf to the home directory doesn't help
two more questions about cleanup if you don't mind:
what if for some old tasks I get WARNING:root:Could not delete Task ID=a0908784a2a942c3812f947ec1f32c9f, 'Task' object has no attribute 'delete'? What's the best way of cleaning them? What is the recommended way of providing S3 credentials to cleanup task?
oh wow, I didn't see delete_artifacts_and_models option
I guess we'll have to manually find old artifacts that are related to already deleted tasks
yeah, I was thinking mainly about AWS. we use force to make sure we are using the correct latest checkpoint, but this increases costs when we are running a lot of experiments
[2020-06-09 16:03:19,851] [8] [ERROR] [trains.mongo.initialize] Failed creating fixed user John Doe: 'key'
{
username: "username"
password: "password"
name: "John Doe"
},
example of the failed experiment
not sure what you mean. I used to do task.set_initial_iteration(task.get_last_iteration()) in the task resuming script, but in the training code I explicitly pass global_step=epoch to the TensorBoard writer
there is no method for setting last iteration , which is used for reporting when continuing the same task. maybe I could somehow change this value for the task?
I've done it many times, using different devices. sometimes it works, sometimes it doesn't
LOL
wow 😃
I was trying to find how to create a queue using CLI 😃
perhaps it’s happening because it’s an old project that was moved to the new root project?
overwriting this value is not ideal though, because for :monitor:gpu and :monitor:machine values I would like to continue from the latest iteration
but for the metrics, I explicitly pass the number of epoch that my training is currently on. it'ls kind of weird that it adds offset to the values that are explicitly reported, no?
okay, so if there’s no workaround atm, should I create a Github issue?
I've already pulled new images from trains-server, let's see if the initial issue occurs again. thank for the fast response guys!
not necessarily, there are rare cases when container keeps running after experiment is stopped or aborted
will do!
great, this helped, thanks! I simply added https://download.pytorch.org/whl/nightly/cu101/torch_nightly.html to trains.conf, and it seems to be working
I now have another problem, my code is looking for some additional files in the root folder of the project. I tried adding a Docker layer:
ADD file.pkl /root/.trains/venvs-builds/3.6/task_repository/project.git/extra_data/
but trains probably rewrites the folder when cloning the repo. is there any workaround?
I decided to restart the containers one more time, this is what I got.
I had to restart Docker service to remove the containers