Reputation
Badges 1
186 × Eureka!I added the link just in case anyway ๐
also, is there any way to install a repo that we clone as a package. we often use absolute imports and do "pip install -e ." to utilize it
sorry there are so many questions, we just really want to migrate to trains-agent)
I use the following Base Docker image setting: my-docker-hub/my-repo:latest -v /data/project/data:/data
original task name contains double space -> saved checkpoint also contains double space -> MODEL URL field in model description of this checkpoint in ClearML converts double space into single space. so when you copy & paste it somewhere, it'll be incorrect
what if cleanup service is launched using ClearML-Agent Services container (part of the ClearML server)? adding clearml.conf to the home directory doesn't help
sounds like an overkill for this problem, but I donโt see any other pretty solution ๐
oh wow, I didn't see delete_artifacts_and_models option
I guess we'll have to manually find old artifacts that are related to already deleted tasks
thanks! we copy S3 URLs quite often. I know that itโs better to avoid double spaces in task names, but shit happens ๐
two more questions about cleanup if you don't mind:
what if for some old tasks I get WARNING:root:Could not delete Task ID=a0908784a2a942c3812f947ec1f32c9f, 'Task' object has no attribute 'delete'? What's the best way of cleaning them? What is the recommended way of providing S3 credentials to cleanup task?
we already have cleanup service set up and running, so we should be good from now on
itโs a pretty standard pytorch train/eval loop, using pytorch dataloader and https://docs.monai.io/en/stable/_modules/monai/data/dataset.html
the code that is used for training the model is also inside the image
I donโt connect anything explicitly, Iโm using argparse, it used to work before the update
yeah, it works for the new projects and for the old projects that have already had a description
this is the artifactory, this is how I install these packages in the Docker image:
pip3 install --pre torch torchvision -f https://download.pytorch.org/whl/nightly/cu101/torch_nightly.html
the files are used for training and evaluation (e.g., precomputed pycocotools meta-info). I could theoretically include them in the repo, but some of them might be quite heavy. what do you mean when you say that they get lost? I copy them from the host machine when I build the custom image, so they are i...
yeah, that sounds right! thanks, will try
for me, increasing shm-size usually helps. what does this RC fix?
I updated the version in the Installed packages section before starting the experiment
LOL
wow ๐
I was trying to find how to create a queue using CLI ๐
great, this helped, thanks! I simply added https://download.pytorch.org/whl/nightly/cu101/torch_nightly.html to trains.conf, and it seems to be working
I now have another problem, my code is looking for some additional files in the root folder of the project. I tried adding a Docker layer:
ADD file.pkl /root/.trains/venvs-builds/3.6/task_repository/project.git/extra_data/
but trains probably rewrites the folder when cloning the repo. is there any workaround?
it also happens sometimes during the run when tensorboard is trying to write smth to the disk and there are multiple experiments running. so it must be smth similar to the scenario you're describing, but I have no idea how it can happen since I'm running four separate workers
that was tough but I finally manage to make it working! thanks a lot for your help, I definitely wouldn't be able to do it without you =)
the only problem that I still encounter is that sometimes there are random errors in the beginning of the runs, especiialy when I enqueue multiple experiments at the same time (I have 4 workers for 4 GPUs).
for example, this
from torch.utils.tensorboard import SummaryWrite
writer = SummaryWriter()
sometimes randomly leads to FileNotFoundError: [Errno...