channels:
- pytorch
- conda-forge
- defaults
dependencies:
- cudatoolkit~=11.1.1
- pytorch~=1.8.0
Works fine
Wtf? can you try with = (notice single not double)?
channels:
- defaults
- conda-forge
- pytorch
dependencies:
- cudatoolkit=11.1.1
- pytorch=1.8.0
The problem is that clearml installs cudatoolkit=11.0
but cudatoolkit=11.1
is needed. By setting agent.cuda_version=11.1
in clearml.conf
it uses the correct version and installs fine. With version 11.0
conda will resolve conflicts by installing pytorch cpu-version.
Do you know how I can make sure I do not have CUDA or a broken installation installed?
And how is
Summary - installed python packages:
conda:
....
generated?
My driver says "CUDA Version: 11.2" (I am not even sure this is correct, since I do not remember installing code in this machine, but idk) and there is no pytorch for 11.2, so maybe it fallbacks to cpu?
For some reason it detect CUDA 11.1 (I assume this is what you have installed, the driver CUDA version is the highest it will support not necessary what you have installed)
@<1523701868901961728:profile|ReassuredTiger98> what do you have in the clearml.conf under "conda_channels" ?
Is this it ?
None
Or there should be an early error for trying to run conda based tasks on pip agents
drwxr-xr-x 10 root root 4096 Jul 31 2020 .
drwxr-xr-x 14 root root 4096 Jul 31 2020 ..
drwxr-xr-x 2 root root 4096 Feb 4 13:52 bin
drwxr-xr-x 2 root root 4096 Jul 31 2020 etc
drwxr-xr-x 2 root root 4096 Jul 31 2020 games
drwxr-xr-x 2 root root 4096 Jul 31 2020 include
drwxr-xr-x 4 root root 4096 Feb 3 13:40 lib
lrwxrwxrwx 1 root root 9 Dez 10 14:29 man -> share/man
drwxr-xr-x 2 root root 4096 Jul 31 2020 sbin
drwxr-xr-x 7 root root 4096 Jul 31 2020 share
drwxr-xr-x 2 root root 4096 Jul 31 2020 src
Still shows CPU version when I run conda list
Thu Mar 11 17:52:45 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.56 Driver Version: 460.56 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce RTX 3090 Off | 00000000:01:00.0 Off | N/A |
| 61% 63C P2 296W / 350W | 8318MiB / 24268MiB | 74% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 GeForce RTX 3090 Off | 00000000:21:00.0 Off | N/A |
| 30% 29C P8 20W / 350W | 1MiB / 24268MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 133165 C+G ...s-builds.1/3.7/bin/python 8314MiB |
+-----------------------------------------------------------------------------+
btw: I also tested the clearml-agent running on a different machine and with python 3.8 and I get the same problems.
How does clearml-agent create the conda environment?
@<1523701868901961728:profile|ReassuredTiger98> thank you so much for testing it!
Hi @<1523701868901961728:profile|ReassuredTiger98> when you get to it...
please download the wheel, then install it with
pip3 install -U clearml_agent-0.17.3rc0-py3-none-any.whl
Then run the daemon with the additional --debug
argument, basically:
clearml-agent --debug daemon --foreground ...
Once the agent is running please send the Task's log from your console 🙂
I do not have a global cuda install on this machine. Everything except for the driver is installed via conda.
You suggested this fix earlier, but I am not sure why it didnt work then.
Can you ping me when it is updated in None so I can update my installation?