Reputation
Badges 1
186 × Eureka!we've already restarted everything, so I don't have any logs on hands right now. I'll let you know if we face any problems π slack bot works though! π
two more questions about cleanup if you don't mind:
what if for some old tasks I get WARNING:root:Could not delete Task ID=a0908784a2a942c3812f947ec1f32c9f, 'Task' object has no attribute 'delete'? What's the best way of cleaning them? What is the recommended way of providing S3 credentials to cleanup task?
JIC - trains still works after that, it's just that the new user is not added and hence is not able to login
right now we can pass github secrets to the clearml agent training containers ( CLEARML_AGENT_GIT_PASS) to install private repos
we need a way to pass secrets to access our database with annotations
I use the following Base Docker image setting: my-docker-hub/my-repo:latest -v /data/project/data:/data
the code that is used for training the model is also inside the image
thanks for the link advice, will do
I'll let you know if I managed to achieve my goals with StorageManager
that was tough but I finally manage to make it working! thanks a lot for your help, I definitely wouldn't be able to do it without you =)
the only problem that I still encounter is that sometimes there are random errors in the beginning of the runs, especiialy when I enqueue multiple experiments at the same time (I have 4 workers for 4 GPUs).
for example, this
from torch.utils.tensorboard import SummaryWrite
writer = SummaryWriter()
sometimes randomly leads to FileNotFoundError: [Errno...
docker mode. they do share the same folder with the training data mounted as a volume, but only for reading the data.
awesome news π
no, I even added the argument to specify tensorboard log_dir to make sure this is not happening
parents and children. maybe tags, maybe separate tab or section, idk. I wonder if anyone else is interested in this functionality, for us this is a very common case
Requirement already satisfied (use --upgrade to upgrade): celsusutils==0.0.1
it also happens sometimes during the run when tensorboard is trying to write smth to the disk and there are multiple experiments running. so it must be smth similar to the scenario you're describing, but I have no idea how it can happen since I'm running four separate workers
thanks, this one worked after we changed the package version
in order to use private repositories for our experiments I add agent.git_user and agent.git_pass options to clearml.conf when launching agents
if someone accidentally tries to launch an experiment from non-existing repo, ClearML will print
fatal: repository ' https://username:token@github.com/our_organization/non_existing_repo.git/ ' not found
exposing the real token
this is the artifactory, this is how I install these packages in the Docker image:
pip3 install --pre torch torchvision -f https://download.pytorch.org/whl/nightly/cu101/torch_nightly.html
the files are used for training and evaluation (e.g., precomputed pycocotools meta-info). I could theoretically include them in the repo, but some of them might be quite heavy. what do you mean when you say that they get lost? I copy them from the host machine when I build the custom image, so they are i...
great, this helped, thanks! I simply added https://download.pytorch.org/whl/nightly/cu101/torch_nightly.html to trains.conf, and it seems to be working
I now have another problem, my code is looking for some additional files in the root folder of the project. I tried adding a Docker layer:
ADD file.pkl /root/.trains/venvs-builds/3.6/task_repository/project.git/extra_data/
but trains probably rewrites the folder when cloning the repo. is there any workaround?
itβs a pretty standard pytorch train/eval loop, using pytorch dataloader and https://docs.monai.io/en/stable/_modules/monai/data/dataset.html
well okay, it's probably not that weird considering that worker just runs the container
our GPUs are 48GB, so it's quite wasteful to only run one job per GPU
yeah, I'm aware of that, I would have to make sure they don't fail to infamous CUDA out of memory, but still
that's right, I have 4 GPUs and 4 workers. but what if I want to run two jobs simultaneously at the same GPU
thanks! I need to read all parts of documentation really carefully =) for some reason, couldn't find this section
is it in documentation somewhere?