Reputation
Badges 1
186 × Eureka!I don't think so because max value of each metric is calculated independently of other metrics
the weird part is that the old job continues running when I recreate the worker and enqueue the new job
I use the following Base Docker image setting: my-docker-hub/my-repo:latest -v /data/project/data:/data
this would be great. I could just then pass it as a hyperparameter
great, this helped, thanks! I simply added https://download.pytorch.org/whl/nightly/cu101/torch_nightly.html to trains.conf, and it seems to be working
I now have another problem, my code is looking for some additional files in the root folder of the project. I tried adding a Docker layer:
ADD file.pkl /root/.trains/venvs-builds/3.6/task_repository/project.git/extra_data/
but trains probably rewrites the folder when cloning the repo. is there any workaround?
task = Task.get_task(task_id = args.task_id)task.mark_started()task.set_parameters_as_dict({"General": {"checkpoint_file": model.url,"restart_optimizer": False,}})task.set_initial_iteration(0)task.mark_stopped()Task.enqueue(task = task, queue_name = task.data.execution.queue)
thanks! I need to read all parts of documentation really carefully =) for some reason, couldn't find this section
maybe I should use explicit reporting instead of Tensorboard
new icons are slick, it would be even better if you could upload custom icons for the different projects
our GPUs are 48GB, so it's quite wasteful to only run one job per GPU
yeah, I'm aware of that, I would have to make sure they don't fail to infamous CUDA out of memory, but still
so max values that I get can be reached at the different epochs
perhaps I need to do task.set_initial_iteration(0)?
sounds like an overkill for this problem, but I donβt see any other pretty solution π
thanks! we copy S3 URLs quite often. I know that itβs better to avoid double spaces in task names, but shit happens π
we do log a lot of the different metrics, maybe this can be part of the problem
okay, what do I do if it IS installed?
does this mean that setting initial iteration to 0 should help?
just DMed you a screenshot where you can see a part of the token
ValueError: Task has no hyperparams section defined
problem is solved. I had to replace /opt/trains/data/fileserver to /opt/clearml/data/fileserver in Agent configuration, and replace trains to clearml in Requirements
for me, increasing shm-size usually helps. what does this RC fix?
yes, this is the use case, I think we can use smth like Redis for this communication
thanks, this one worked after we changed the package version
on the side note, is there any way to automatically give more meaningful names to the running docker containers?
not necessarily, there are rare cases when container keeps running after experiment is stopped or aborted
will do!
we have a baremetal server with ClearML agents, and sometimes there are hanging containers or containers that consume too much RAM. unless I explicitly add container name in container arguments, it will have a random name, which is not very convenient. it would be great if we could set default container name for each experiment (e.g., experiment id)