Reputation
Badges 1
606 × Eureka!So only short update for today: I did not yet start a run with conda 4.7.12.
But one question: Actually conda can not be at fault here, right? I can install pytorch just fine locally on the agent, when I do not use clearml(-agent)
I just wanna add: I can run this task on the same workstation with the same conda installation just fine.
Or there should be an early error for trying to run conda based tasks on pip agents
The problem is that clearml installs cudatoolkit=11.0
but cudatoolkit=11.1
is needed. By setting agent.cuda_version=11.1
in clearml.conf
it uses the correct version and installs fine. With version 11.0
conda will resolve conflicts by installing pytorch cpu-version.
This my environment installed from env file. Training works just fine here:
Or better some cache option. Otherweise the cron job is what I will use 🙂 Thanks again
One question: Does clearml resolve the CUDA Version from driver or conda?
Yes, that works fine. Just the http vs https was the problem. The UI will automatically change s3://<minio-address>:<port>
to
http://<minio-address>:<port>
in http://myclearmlserver.org/settings/webapp-configuration . However what is needed for me is https://<minio-address>:<port>
name: core
channels:
- pytorch
- conda-forge
- defaults
dependencies:
- _libgcc_mutex=0.1
- _openmp_mutex=4.5
- blas=1.0
- bzip2=1.0.8
- ca-certificates=2020.12.5
- certifi=2020.12.5
- cudatoolkit=11.1.1
- ffmpeg=4.3
- freetype=2.10.4
- gmp=6.2.1
- gnutls=3.6.13
- jpeg=9b
- lame=3.100
- lcms2=2.11
- ld_impl_linux-64=2.33.1
- libedit=3.1.20191231
- libffi=3.3
- libgcc-ng=9.3.0
- libiconv=1.16
- libpng=1.6.37
- libstdcxx-ng=9.3.0
- libtiff...
Okay, no worries. I will check first. Thanks for helping!
One more thing: The cuda_version that clearml finds automatically is wrong.
I see, so it is actually not related to clearml 🎉
For example I run a task remotely. Now I decide I want to rerun it, but slightly change a parameter. So I clone the task and edit the parameter in the WebUI. Then I submit the task to a queue. When the clearml-agent pulls the tasks and tries to install the requirements, it will fail since the task requirements now contain packages that had been preinstalled in the environment (e.g. nvidia docker). These packages may not be available via pip, so the run will fail.
I ll add creating an issue to my todo list
AgitatedDove14 Thank you, that explains it.
When the task is aborted I, the logs will show up, but the scalar logs will never appear. The scalar logs only appear when the task finishes.
SweetBadger76 I am using the Cleanup Service
Ah, sore should have been more specific. I mean on the ClearML server.
Ok. I just wanted to make sure I have configured my agent properly. Just want to make sure I have to set it on all agents.
Thank you very much. I also saw a solution based on systemd and many more, so I am wondering what the best way is or does it even matter?
mytask.get_logger().current_logger().set_default_upload_destination("
s3://ip:9000/clearml ")
this is what I do. Do you do the same?
Thank you for the quick reply. Maybe anyone knows whether there is an option to let docker delete images after container exit?
Is sdk.development.default_output_uri
used with s3://ip:9000/clearml or
ip:9000/clearml
?