
Reputation
Badges 1
981 × Eureka!So get_registered_artifacts()
only works for dynamic artifacts right? I am looking for a download_artifacts()
which allows me to retrieve static artifacts of a Task
Yes, I am preparing them π
Which commit corresponds to RC version? So far we tested with latest commit on master (9a7850b23d2b0e1f2098ab051de58ce806143fff)
The rest of the configuration is set with env variables
Hi @<1523701205467926528:profile|AgitatedDove14> , I want to circule back on this issue. This is still relevant and I could collect the following on an ec2 instance running a clearml-agent running a stuck task:
- There seems to be a problem with multiprocessing: Although I stopped the task, there are still so many processes forked from the main training process. I guess these are zombies. Please check the htop tree.
- There is a memory leak somewhere, please see the screenshot of datadog mem...
both are repos for python modules (experiment one and dependency of the experiment)
Thanks, the message is not logged in GCloud instances logs when using startup scripts, this is why I did not see it. π
Yes that was my assumption as well, it could be several causes to be honest now that I see that also matplotlib itself is leaking π
I also would like to avoid any copy of these artifacts on s3 (to avoid double costs, since some folders might be big)
I mean that I have a taskA (controller) that is in charge of creating a taskB with the same argv parameters (I just change the entry point of taskB)
Hi AgitatedDove14 , coming by after a few experiments this morning:
Indeed torch 1.3.1 does not support cuda, I tried with 1.7.0 and it worked, BUT trains was not able to pick the right wheel when I updated the torch req from 1.3.1 to 1.7.0: It downloaded wheel for cuda version 101. But in the experiment log, the agent correctly reported the cuda version (111). I then replaced the torch==1.7.0 with the direct https link to the torch wheel for cuda 110, and that worked (I also tried specifyin...
Still investigating, task.data.last_iteration
is correct (equal to engine.state["iteration"]
) when I resume the training
Here is (left) the data disk (/opt/clearml) and right the OS disk
SuccessfulKoala55 I found the issue thanks to you: I changed a bit the domain but didnβt update the apiserver.auth.cookies.domain
setting - I did it, restarted and now it works π Thanks!
and then call task.connect_configuration probably
RuntimeError: CUDA error: no kernel image is available for execution on the device
Iβve set dynamic: βstrictβ in the template of the logs index and I was able to keep the same mapping after doing the reindex
PS: in the new env, Iβv set num_replicas: 0, so Iβm only talking about primary shardsβ¦
Thanks for the help SuccessfulKoala55 , the problem was solved by updating the docker-compose file to the latest version in the repo: https://github.com/allegroai/clearml-server/blob/master/docker/docker-compose.yml
Make sure to do docker-compose down & docker-compose up -d
afterwards, and not docker-compose restart
--- /data ---------- 48.4 GiB [##########] /elastic_7 1.8 GiB [ ] /shared 879.1 MiB [ ] /fileserver . 163.5 MiB [ ] /clearml_cache . 38.6 MiB [ ] /mongo 8.0 KiB [ ] /redis