Reputation
Badges 1
606 × Eureka!Yep, I will add this as an issue. Btw: Should I rather post the kind of questions I am asking as an issue or do they fit better here?
Or does MinIO delay deletion somehow? Deleting a task via the web interface also does not result in deletion of debug samples on MinIO
Thanks, I will look into it. For me the weird thing is that saving works and only deletion fails somehow.
That seems to be the case. After parsing the args I run task = Task.init(...)
and then task.execute_remotely(queue_name=args.enqueue, clone=False, exit_process=True)
.
But here is the funny thing:
channels:
- pytorch
- conda-forge
- defaults
dependencies:
- cudatoolkit=11.1.1
- pytorch=1.8.0
Installs GPU
Is ther a way to see the contents of /tmp/conda_envaz1ne897.yml
? Seems to be deleted after the task is finihsed
Perfect, will try it. fyi: The conda_channels that I used are from clearml-agent init
Is this working in the latest version? clearml-agent falls back to /usr/bin/python3.8
no matter how I configure clearml.conf
Just want to make sure, so I can investigate what's wrong with my machine if it is working for you.
Thank you very much. I tested it on a different machine now and it works like intended. So there must be something misconfigured with this one machine.
CostlyOstrich36 Actually no container exits, so I guess if it s because of OOM like SuccessfulKoala55 implies, than maybe a process inside the container gets killed and the container will hang? Is this possible?
SuccessfulKoala55 I did not observe elastic to use much RAM (at least right after starting). Doesn't this line in the docker-compose control the RAM usage?ES_JAVA_OPTS: -Xms2g -Xmx2g -Dlog4j2.formatMsgNoLookups=true
For now I can tell you that with conda_freeze: true
it fails, but with conda_freeze: false
it works!
Do you know how I can make sure I do not have CUDA or a broken installation installed?
Type "help", "copyright", "credits" or "license" for more information.
>>> from clearml_agent.helper.gpu.gpustat import get_driver_cuda_version
>>> get_driver_cuda_version()
'110'
I do not have a global cuda install on this machine. Everything except for the driver is installed via conda.
I tried to run the task with detect_with_conda_freeze: false
instead of true
and got
Executing Conda: /home/tim/miniconda3/condabin/conda install -p /home/tim/.clearml/venvs-builds/3.8 -c defaults -c conda-forge -c pytorch 'pip<20.2' --quiet --json
Pass
Conda: Trying to install requirements:
['pytorch~=1.8.0']
Executing Conda: /home/tim/miniconda3/condabin/conda env update -p /home/tim/.clearml/venvs-builds/3.8 --file /tmp/conda_envh7rq4qmc.yml --quiet --json
Conda error: Unsati...
ca-certificates 2021.1.19 h06a4308_1
certifi 2020.12.5 py38h06a4308_0
cudatoolkit 11.0.221 h6bb024c_0
ld_impl_linux-64 2.33.1 h53a641e_7
libedit 3.1.20191231 h14c3975_1
libffi 3.3 he6710b0_2
libgcc-ng 9.1.0 hdf63c60_0
libstdcxx-ng 9.1.0 hdf63c60_0
ncurses ...
clearml will register preinstalled conda packages as requirements.
So it seems to be definitely a problem with docker and not with clearml. However, I do not get, why it works for you but on none of my machine (all Ubuntu 20.04 with docker 20.10)
In the beginning my config file was not empty 😕