But here is the funny thing:
channels:
- pytorch
- conda-forge
- defaults
dependencies:
- cudatoolkit=11.1.1
- pytorch=1.8.0
Installs GPU
Is ther a way to see the contents of /tmp/conda_envaz1ne897.yml
? Seems to be deleted after the task is finihsed
Perfect, will try it. fyi: The conda_channels that I used are from clearml-agent init
Is this working in the latest version? clearml-agent falls back to /usr/bin/python3.8
no matter how I configure clearml.conf
Just want to make sure, so I can investigate what's wrong with my machine if it is working for you.
Thank you very much. I tested it on a different machine now and it works like intended. So there must be something misconfigured with this one machine.
CostlyOstrich36 Actually no container exits, so I guess if it s because of OOM like SuccessfulKoala55 implies, than maybe a process inside the container gets killed and the container will hang? Is this possible?
SuccessfulKoala55 I did not observe elastic to use much RAM (at least right after starting). Doesn't this line in the docker-compose control the RAM usage?ES_JAVA_OPTS: -Xms2g -Xmx2g -Dlog4j2.formatMsgNoLookups=true
For now I can tell you that with conda_freeze: true
it fails, but with conda_freeze: false
it works!
Do you know how I can make sure I do not have CUDA or a broken installation installed?
Type "help", "copyright", "credits" or "license" for more information.
>>> from clearml_agent.helper.gpu.gpustat import get_driver_cuda_version
>>> get_driver_cuda_version()
'110'
I do not have a global cuda install on this machine. Everything except for the driver is installed via conda.
I tried to run the task with detect_with_conda_freeze: false
instead of true
and got
Executing Conda: /home/tim/miniconda3/condabin/conda install -p /home/tim/.clearml/venvs-builds/3.8 -c defaults -c conda-forge -c pytorch 'pip<20.2' --quiet --json
Pass
Conda: Trying to install requirements:
['pytorch~=1.8.0']
Executing Conda: /home/tim/miniconda3/condabin/conda env update -p /home/tim/.clearml/venvs-builds/3.8 --file /tmp/conda_envh7rq4qmc.yml --quiet --json
Conda error: Unsati...
ca-certificates 2021.1.19 h06a4308_1
certifi 2020.12.5 py38h06a4308_0
cudatoolkit 11.0.221 h6bb024c_0
ld_impl_linux-64 2.33.1 h53a641e_7
libedit 3.1.20191231 h14c3975_1
libffi 3.3 he6710b0_2
libgcc-ng 9.1.0 hdf63c60_0
libstdcxx-ng 9.1.0 hdf63c60_0
ncurses ...
However, I cloned the experiment again via the web UI. Then I enqueued it.
Hey, thank you for answering.
I know this issue and I have it sometimes, but my current issue is a direct result of me trying to make SSL work. So I am not asking for help in solving my problem, but only for help how to debug. Finding out which step leads to the artifact not being deleted (e.g. the fileserver cannot be reached by from wherever the delete request is send)
At least when you use docker containers the agent will reuse the existing python environment.
I just checked and my user is part of the docker group.
Sounds like a good hack, but not like a good solution 😄 But thank you anyways! 🙂
No reason in particular. How many people work at http://allegro.ai ?
Thanks a lot. But even for a user, I can not set a default for all projects, right?