It's always preferred to use conda_freeze: false
That said, if you do use conda_freeze: true
it should also freeze the cudatoolkit, so it should have worked.
BTW when you say it worked, is it 0.17.2 version or the hacked RC I sent ?
Is ther a way to see the contents of /tmp/conda_envaz1ne897.yml
? Seems to be deleted after the task is finihsed
Type "help", "copyright", "credits" or "license" for more information.
>>> from clearml_agent.helper.gpu.gpustat import get_driver_cuda_version
>>> get_driver_cuda_version()
'110'
My driver says "CUDA Version: 11.2" (I am not even sure this is correct, since I do not remember installing code in this machine, but idk) and there is no pytorch for 11.2, so maybe it fallbacks to cpu?
Okay this is very close to what the agent is building:
Could you start a new conda env,
then install cudatoolkit=11.1
then run:
conda env update -p <conda_env_path_here> --file the_env_yaml.yml
I tried to run the task with detect_with_conda_freeze: false
instead of true
and got
Executing Conda: /home/tim/miniconda3/condabin/conda install -p /home/tim/.clearml/venvs-builds/3.8 -c defaults -c conda-forge -c pytorch 'pip<20.2' --quiet --json
Pass
Conda: Trying to install requirements:
['pytorch~=1.8.0']
Executing Conda: /home/tim/miniconda3/condabin/conda env update -p /home/tim/.clearml/venvs-builds/3.8 --file /tmp/conda_envh7rq4qmc.yml --quiet --json
Conda error: UnsatisfiableError: The following specifications were found to be incompatible with a past
explicit spec that is not an explicit spec in this operation (cudatoolkit):
- pytorch~=1.8.0 -> cudatoolkit[version='>=10.1,<10.2|>=10.2,<10.3']
The following specifications were found to be incompatible with each other:
Package cudatoolkit conflicts for:
cudatoolkit=11.0
Conda: Installing requirements: step 2 - using pip:
['clearml==0.17.4', 'tensorboard==2.4.1', 'pytorch~=1.8.0']
Collecting tensorboard==2.4.1
Using cached tensorboard-2.4.1-py3-none-any.whl (10.6 MB)
ERROR: Could not find a version that satisfies the requirement pytorch~=1.8.0 (from -r /tmp/cached-reqsubuv0zrf.txt (line 3)) (from versions: 0.1.2, 1.0.2)
ERROR: No matching distribution found for pytorch~=1.8.0 (from -r /tmp/cached-reqsubuv0zrf.txt (line 3))
Command 'source /home/tim/miniconda3/etc/profile.d/conda.sh && conda activate /home/tim/.clearml/venvs-builds/3.8 && pip install -r /tmp/cached-reqsubuv0zrf.txt' returned non-zero exit status 1.
send me the conda freeze:
# Name Version Build Channel
...
fyi: NVIDIA-SMI 460.56 Driver Version: 460.56 CUDA Version: 11.2
Thu Mar 11 17:52:45 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.56 Driver Version: 460.56 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce RTX 3090 Off | 00000000:01:00.0 Off | N/A |
| 61% 63C P2 296W / 350W | 8318MiB / 24268MiB | 74% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 GeForce RTX 3090 Off | 00000000:21:00.0 Off | N/A |
| 30% 29C P8 20W / 350W | 1MiB / 24268MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 133165 C+G ...s-builds.1/3.7/bin/python 8314MiB |
+-----------------------------------------------------------------------------+
I just tried to envrionment setup steps that clearml-agent is doing locally, but with my environment.yml instead of the one that clearml generates.
@<1523701868901961728:profile|ReassuredTiger98> if you use the latest RC! i sent and run with --debug
in the log you will see the full /tmp/conda_envaz1ne897.yml
content
Here it is copied from your log, do you want to see if this one works:
channels:
- defaults
- conda-forge
- pytorch
dependencies:
- blas~=1.0
- bzip2~=1.0.8
- ca-certificates~=2020.10.14
- certifi~=2020.6.20
- cloudpickle~=1.6.0
- cudatoolkit~=11.1.1
- cycler~=0.10.0
- cytoolz~=0.11.0
- dask-core~=2021.2.0
- decorator~=4.4.2
- ffmpeg~=4.3
- freetype~=2.10.4
- gmp~=6.2.1
- gnutls~=3.6.13
- imageio~=2.9.0
- jpeg~=9b.0
- kiwisolver~=1.3.1
- lame~=3.100
- lcms2~=2.11
- ld_impl_linux-64~=2.33.1
- libedit~=3.1.20191231
- libffi~=3.3
- libgcc-ng~=9.3.0
- libgfortran-ng~=7.3.0
- libiconv~=1.16
- libpng~=1.6.37
- libstdcxx-ng~=9.3.0
- libtiff~=4.1.0
- libuv~=1.41.0
- llvm-openmp~=11.0.1
- lz4-c~=1.9.3
- matplotlib-base~=3.3.4
- mkl~=2020.4
- mkl-service~=2.3.0
- mkl_fft~=1.3.0
- mkl_random~=1.2.0
- ncurses~=6.2
- nettle~=3.6
- networkx~=2.5
- ninja~=1.10.2
- numpy~=1.19.2
- numpy-base~=1.19.2
- olefile~=0.46
- openh264~=2.1.1
- openssl~=1.1.1j
- pyparsing~=2.4.7
- python~=3.7.10
- python-dateutil~=2.8.1
- python_abi~=3.7
- pytorch~=1.8.0
- pywavelets~=1.1.1
- pyyaml~=5.3.1
- readline~=8.1
- scikit-image~=0.17.2
- scipy~=1.6.1
- setuptools~=52.0.0
- six~=1.15.0
- sqlite~=3.33.0
- tifffile~=2020.10.1
- tk~=8.6.10
- toolz~=0.11.1
- torchaudio~=0.8.0
- torchvision~=0.9.0
- tornado~=6.1
- typing_extensions~=3.7.4.3
- wheel~=0.36.2
- xz~=5.2.5
- yaml~=0.2.5
- zlib~=1.2.11
- zstd~=1.4.9
And the one with the CPU version? is it with "~=" or "="?
Okay this seems correct:
pytorch=1.8.0=py3.7_cuda11.1_cudnn8.0.5_0
I can't seem to find what's the diff between the two.
Give me a second let me check if I can reproduce it somehow.
@<1523701868901961728:profile|ReassuredTiger98> it works on my machine 😞
Would it help you diagnose this problem if I ran conda env create --file=environment.yml
and see whether it works?
Do you know how I can get this version?
For now I can tell you that with conda_freeze: true
it fails, but with conda_freeze: false
it works!