great, this helped, thanks! I simply added https://download.pytorch.org/whl/nightly/cu101/torch_nightly.html to trains.conf, and it seems to be working
I now have another problem, my code is looking for some additional files in the root folder of the project. I tried adding a Docker layer:
ADD file.pkl /root/.trains/venvs-builds/3.6/task_repository/project.git/extra_data/
but trains probably rewrites the folder when cloning the repo. is there any workaround?
task = Task.get_task(task_id = args.task_id)task.mark_started()task.set_parameters_as_dict({"General": {"checkpoint_file": model.url,"restart_optimizer": False,}})task.set_initial_iteration(0)task.mark_stopped()Task.enqueue(task = task, queue_name = task.data.execution.queue)
thanks! I need to read all parts of documentation really carefully =) for some reason, couldn't find this section
maybe I should use explicit reporting instead of Tensorboard
new icons are slick, it would be even better if you could upload custom icons for the different projects
our GPUs are 48GB, so it's quite wasteful to only run one job per GPU
yeah, I'm aware of that, I would have to make sure they don't fail to infamous CUDA out of memory, but still
so max values that I get can be reached at the different epochs
perhaps I need to do task.set_initial_iteration(0)?
thanks! we copy S3 URLs quite often. I know that itβs better to avoid double spaces in task names, but shit happens π
we do log a lot of the different metrics, maybe this can be part of the problem
okay, what do I do if it IS installed?
does this mean that setting initial iteration to 0 should help?
just DMed you a screenshot where you can see a part of the token
ValueError: Task has no hyperparams section defined
problem is solved. I had to replace /opt/trains/data/fileserver to /opt/clearml/data/fileserver in Agent configuration, and replace trains to clearml in Requirements
for me, increasing shm-size usually helps. what does this RC fix?
yes, this is the use case, I think we can use smth like Redis for this communication
thanks, this one worked after we changed the package version
on the side note, is there any way to automatically give more meaningful names to the running docker containers?
not necessarily, there are rare cases when container keeps running after experiment is stopped or aborted
will do!
we have a baremetal server with ClearML agents, and sometimes there are hanging containers or containers that consume too much RAM. unless I explicitly add container name in container arguments, it will have a random name, which is not very convenient. it would be great if we could set default container name for each experiment (e.g., experiment id)
standalone-mode gives me "Could not freeze installed packages"
that was tough but I finally manage to make it working! thanks a lot for your help, I definitely wouldn't be able to do it without you =)
the only problem that I still encounter is that sometimes there are random errors in the beginning of the runs, especiialy when I enqueue multiple experiments at the same time (I have 4 workers for 4 GPUs).
for example, this
from torch.utils.tensorboard import SummaryWrite
writer = SummaryWriter()
sometimes randomly leads to FileNotFoundError: [Errno...
well okay, it's probably not that weird considering that worker just runs the container
after the very first click, there is a popup with credentials request. nothing happens after that
we often do ablation studies with more than 50 experiments, and it was very convenient to compare their dynamics at the different epochs