yeah, we've used pipelines in other scenarios. might be a good fit here. thanks!
we have a baremetal server with ClearML agents, and sometimes there are hanging containers or containers that consume too much RAM. unless I explicitly add container name in container arguments, it will have a random name, which is not very convenient. it would be great if we could set default container name for each experiment (e.g., experiment id)
standalone-mode gives me "Could not freeze installed packages"
nice! exactly what I need, thank you!
it prints an empty dict
I’m doing Task.init() in the script, maybe it somehow resets connected parameters… but it used to work before, weird
ValueError: Task has no hyperparams section defined
nope, same problem even after creating a new experiment from scratch
weird
this is what I got in installed packages without adding the direct link:
torch==1.6.0.dev20200430+cu101
torchvision==0.7.0.dev20200430+cu101
I decided to restart the containers one more time, this is what I got.
I had to restart Docker service to remove the containers
not necessarily, there are rare cases when container keeps running after experiment is stopped or aborted
will do!
yeah, backups take much longer, and we had to increase our EC2 instance volume size twice because of these indices
got it, thanks, will try to delete older ones
hmmm allegroai/trains:latest whatever it is
on the side note, is there any way to automatically give more meaningful names to the running docker containers?
yeah, I was thinking mainly about AWS. we use force to make sure we are using the correct latest checkpoint, but this increases costs when we are running a lot of experiments
we’re using latest ClearML server and client version (1.2.0)
it will probably screw up my resource monitoring plots, but well, who cares 😃
perhaps it’s happening because it’s an old project that was moved to the new root project?
I'll get back to you with the logs when the problem occurs again
I change the arguments in Web UI, but it looks like they are not parsed by trains
same here, changing arguments in the Args section of Hyperparameters doesn’t work, training script starts with the default values.
trains 0.16.0
trains-agent 0.16.0
trains-server 0.16.0
this definitely would be a nice addition. number of hyperparameters in our models often goes up to 100
copy-pasting entire training command into command line 😃
docker mode. they do share the same folder with the training data mounted as a volume, but only for reading the data.
awesome news 👍
we're using the latest version of clearml, clearml agent and clearml server, but we've been using trains/clearml for 2.5 years, so there are some old tasks left, I guess 😃
the weird part is that the old job continues running when I recreate the worker and enqueue the new job