conda_channels: ["defaults", "conda-forge", "pytorch", ]
btw: why is agent.package_manager
and agent attribute. Imo it does not make sense because conda can install pip packages, but pip cannot install conda packages which can lead to install failures, right?
Damn, okay I'll make sure we fix the order.
Could you verify the ~= works as intended (if the order id correct)
Just tested again. The ordering definitly matters.
Well, in that case, just change the order it should solve it (I'll make sure we have that as the default:
conda_channels: ["pytorch", "conda-forge", "defaults", ]
It should solve the issue 🙂
Type "help", "copyright", "credits" or "license" for more information.
>>> from clearml_agent.helper.gpu.gpustat import get_driver_cuda_version
>>> get_driver_cuda_version()
'110'
(This is why we recommend using pip, because it is stable and clearml-agent takes care of pytorch/cuda verions)
From the logs when ran with --foreground I
I do not see any conda create
command.
I just tried to envrionment setup steps that clearml-agent is doing locally, but with my environment.yml instead of the one that clearml generates.
Hmm maybe this is the issue, :
Conda error: UnsatisfiableError: The following specifications were found to be incompatible with a past
explicit spec that is not an explicit spec in this operation (cudatoolkit):
- pytorch~=1.8.0 -> cudatoolkit[version='>=10.1,<10.2|>=10.2,<10.3']
This makes no sense, conda is saying pytorch=1.8 needs cudatoolkit <10.2/10.3 but actually it needs cudatoolkit 11.1
channels:
- defaults
- conda-forge
- pytorch
dependencies:
- cudatoolkit==11.1.1
- pytorch==1.8.0
Gives CPU version
No problem! I profit so much from clearml 🙂
okay, I'll make sure we order it correctly
Yes that is exactly what I will make sure we change :)
I try it one more time just to make sure
Quick question: Where again does clearml place the venv? I wanna take a look into it after the task has failed
@<1523701868901961728:profile|ReassuredTiger98> what do you have in the clearml.conf under "conda_channels" ?
Is this it ?
None
Or there should be an early error for trying to run conda based tasks on pip agents
drwxr-xr-x 10 root root 4096 Jul 31 2020 .
drwxr-xr-x 14 root root 4096 Jul 31 2020 ..
drwxr-xr-x 2 root root 4096 Feb 4 13:52 bin
drwxr-xr-x 2 root root 4096 Jul 31 2020 etc
drwxr-xr-x 2 root root 4096 Jul 31 2020 games
drwxr-xr-x 2 root root 4096 Jul 31 2020 include
drwxr-xr-x 4 root root 4096 Feb 3 13:40 lib
lrwxrwxrwx 1 root root 9 Dez 10 14:29 man -> share/man
drwxr-xr-x 2 root root 4096 Jul 31 2020 sbin
drwxr-xr-x 7 root root 4096 Jul 31 2020 share
drwxr-xr-x 2 root root 4096 Jul 31 2020 src
Still shows CPU version when I run conda list
Thu Mar 11 17:52:45 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.56 Driver Version: 460.56 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce RTX 3090 Off | 00000000:01:00.0 Off | N/A |
| 61% 63C P2 296W / 350W | 8318MiB / 24268MiB | 74% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 GeForce RTX 3090 Off | 00000000:21:00.0 Off | N/A |
| 30% 29C P8 20W / 350W | 1MiB / 24268MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 133165 C+G ...s-builds.1/3.7/bin/python 8314MiB |
+-----------------------------------------------------------------------------+
btw: I also tested the clearml-agent running on a different machine and with python 3.8 and I get the same problems.
channels:
- pytorch
- conda-forge
- defaults
dependencies:
- cudatoolkit~=11.1.1
- pytorch~=1.8.0
Works fine