Reputation
Badges 1
606 × Eureka!I am going to try it again and send you the relevant part of the logs in a minute. Maybe I am interpreting something wrong.
I used the wrong docker container. The docker container I used had version 11.4. Interestingly, the override from clearml.conf and CUDA_VERSION Env variable did not work there.
With the correct docker container everything works fine. Shame on me.
By host you mean the machine on which the agent is running? How does clearml-agent find the cuda_version?
# This file may be used to create an environment using:
# $ conda create --name <env> --file <this file>
# platform: linux-64
_libgcc_mutex=0.1=conda_forge
_openmp_mutex=4.5=1_llvm
absl-py=0.12.0=pypi_0
aiostream=0.4.2=pypi_0
attrs=20.3.0=pypi_0
blas=1.0=mkl
bzip2=1.0.8=h7b6447c_0
ca-certificates=2020.10.14=0
cached-property=1.5.2=pypi_0
cachetools=4.2.1=pypi_0
certifi=2020.6.20=py37_0
chardet=4.0.0=pypi_0
clearml=0.17.4=pypi_0
cloudpickle=1.6.0=py_0
cudatoolkit=11.1.1=h6406543_8
cycler...
channels:
- pytorch
- conda-forge
- defaults
dependencies:
- cudatoolkit~=11.1.1
- pytorch~=1.8.0
Works fine
Do you know how I can make sure I do not have CUDA or a broken installation installed?
[2021-05-07 10:52:00,282] [9] [WARNING] [elasticsearch] POST
` [status:N/A request:60.058s]
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 445, in _make_request
six.raise_from(e, None)
File "<string>", line 3, in raise_from
File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 440, in _make_request
httplib_response = conn.getresponse()
File "/usr/lib64/python3.6/http/client.py", lin...
btw: I am pretty sure this used to work, but then stopped work some time ago.
The one I posted on top 22.03-py3
😄
Yea, something like this seems to be the best solution.
I just tried to envrionment setup steps that clearml-agent is doing locally, but with my environment.yml instead of the one that clearml generates.
So I just tried again, but with manual deleting via Web UI.
I tried to run the task with detect_with_conda_freeze: false
instead of true
and got
Executing Conda: /home/tim/miniconda3/condabin/conda install -p /home/tim/.clearml/venvs-builds/3.8 -c defaults -c conda-forge -c pytorch 'pip<20.2' --quiet --json
Pass
Conda: Trying to install requirements:
['pytorch~=1.8.0']
Executing Conda: /home/tim/miniconda3/condabin/conda env update -p /home/tim/.clearml/venvs-builds/3.8 --file /tmp/conda_envh7rq4qmc.yml --quiet --json
Conda error: Unsati...
conda env update -p .clearml/venvs-builds/3.8 ./environment.yml
with environment.yml
name: clearml
channels:
- pytorch
- anaconda
- conda-forge
- defaults
dependencies:
- pytorch==1.8.0
I will try again tomorrow. It s getting late! Thank you for helping so far!
Thu Mar 11 17:52:45 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.56 Driver Version: 460.56 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | ...
Now for some reason everything is gone .. 😕
I guess it started with the usage of the cleanup_service.
Thanks a lot. I somehow missed this.
I have no idea myself, but what the serverfault thread says about man-in-the-middle makes sense. However this also prohibits an automatic solution except for a shared known_hosts file I guess.