Reputation
Badges 1
75 × Eureka!It's too much of a hack :)
CostlyOstrich36 in installed packages it has:
` # Python 3.10.6 | packaged by conda-forge | (main, Aug 22 2022, 20:41:22) [Clang 13.0.1 ]
Pillow == 9.2.0
clearml == 1.7.1
minio == 7.1.12
numpy == 1.23.1
pandas == 1.5.0
scikit_learn == 1.1.2
tensorboard == 2.10.1
torch == 1.12.1
torchvision == 0.13.1
tqdm == 4.64.1 `Which is the same as I have locally and on the server that runs clearml-agent
Yeah, pytorch is a must. This script is a testing one, but after this I need to train stuff on GPUs
For a hacky way you can do docker ps
and see the docker run command. I believe it contains the task id, so you can grep by task id
The issue disappeared after I switched from docker mode to pip mode
Sure, will so tomorrow
Looking through history I found this link: None
Tldr: ClearML doesn’t support lightning
, but supports pytorch_lightning
. Downgrading from the new interface to the old one fixed my issue
I dont have a short version.
I am using community clearml. How do I find out my version?
Here’s the error I get:
https://justpaste.it/7aom5
It’s trying to downgrade pytorch to 1.12.1 for some reason (why?) using a version for an outdated CUDA (I have 11.7, it tries to use pytorch for CUDA 11.6). Finally crashes
I guess I am out of ideas. The config is wrong somewhere. Maybe double check all the configs? It’s taking the value from somewhere!
(agent) adamastor@adamastor:~/clearml_agent$ python -c "import torch; print(torch.__version__)" 1.12.1
The image I am using is pytorch/pytorch:1.7.0-cuda11.0-cudnn8-devel
I mean if I enter my host machine ssh password it works. But we will disable password auth in future, so it’s not an option
The agent is started with this command:clearml-agent --debug daemon --queue gpu --gpus 0 --foreground --docker <gitlab org registry>/project-precog/clearml_config
The agent is started from a non-root user if that matters
The task log is here:
the log on my local machine is here:
I can telnet the port from my mac:(base) *[main][~/Documents/plant_age]$ telnet 192.168.1.55 10022 Trying 192.168.1.55... Connected to 192.168.1.55. Escape character is '^]'. SSH-2.0-OpenSSH_8.4p1 Debian-5+deb11u1 ^C
I tried it.
This time agent was run with docker image python ( https://hub.docker.com/_/python )
Gets stuck onInstalling collected packages: six, python-dateutil, pathlib2, psutil, attrs, pyrsistent, jsonschema, idna, chardet, certifi, urllib3, requests, PyYAML, pyparsing, pyjwt, pyhocon, orderedmultidict, furl, future, platformdirs, filelock, distlib, virtualenv, clearml-agent
ps aux inside the container reads
` (base) boris@adamastor:~$ docker exec -it angry_edison bash
root@041c0736c...
I guess this pip package installation happens as part of docker build
Also manually installing this torch version succeeds:
` (base) boris@adamastor:~$ python3.10 -m pip install /home/boris/.clearml/pip-download-cache/cu117/torch-1.12.1+cu116-cp310-cp310-linux_x86_64.whl
Processing ./.clearml/pip-download-cache/cu117/torch-1.12.1+cu116-cp310-cp310-linux_x86_64.whl
Requirement already satisfied: typing-extensions in ./miniconda3/lib/python3.10/site-packages (from torch==1.12.1+cu116) (4.3.0)
Installing collected packages: torch
Attempting uninstall: torch
...
Pytorch is configured on the machine that’s running the agent. It’s also in requirements
Agent is running in docker mode. The host OS is ubuntu
On the agent side it’s trying to install different pytorch versions (even though the env already has it all configured), then fails with torch_<something>.whl is not a valid wheel for this system
Is there some minimal example of a docker env agent I can run, just to see that it works?
This issue was resolved by setting the correct clearml.conf
(replacing localhost with a public hostname for the server) 🙂
Sure, will send in a few min when it executes
Btw it seems the docker runs in network=host
I don’t understand. The current cuda version is 11.7. Installed pytorch version is 1.12.1. Torch can access GPUs, all is fine.
Why does it try to install a different torch version?
` (base) boris@adamastor:~$ nvidia-smi
Fri Oct 7 14:16:24 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01 Driver Version: 515.65.01 CUDA Version: 11.7 |
|-------------------------------+----------------------+----------------------+
| GPU Name ...