Reputation
Badges 1
75 × Eureka!I resolved the issues by making my own docker image and making all envs the same:
The env that runs clearml-agent The docker env for running tasks in The env that requests task execution (my client)
Also manually installing this torch version succeeds:
` (base) boris@adamastor:~$ python3.10 -m pip install /home/boris/.clearml/pip-download-cache/cu117/torch-1.12.1+cu116-cp310-cp310-linux_x86_64.whl
Processing ./.clearml/pip-download-cache/cu117/torch-1.12.1+cu116-cp310-cp310-linux_x86_64.whl
Requirement already satisfied: typing-extensions in ./miniconda3/lib/python3.10/site-packages (from torch==1.12.1+cu116) (4.3.0)
Installing collected packages: torch
Attempting uninstall: torch
...
I don’t understand. The current cuda version is 11.7. Installed pytorch version is 1.12.1. Torch can access GPUs, all is fine.
Why does it try to install a different torch version?
` (base) boris@adamastor:~$ nvidia-smi
Fri Oct 7 14:16:24 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01 Driver Version: 515.65.01 CUDA Version: 11.7 |
|-------------------------------+----------------------+----------------------+
| GPU Name ...
The failure is that it does not even run
Well I don’t want that! My local machine is a Mac with no GPU. But I want to execute my code on a server with GPUs. I don’t want my local environment, I want the one configured for the agent!
I understand the idea, it makes sense. But it does not seem to work as intended. Why does it try to install a different pytorch? And why does it fail if it works if I do it manually? The env that’s executing the task has the same pytorch
Anyways, what should I do? So far my workers have not executed a single task, it always breaks with these env errors
Despite having manually installed this torch version, during task execution agent still tries to install it somehow and fails:INFO:clearml_agent.commands.worker:Downloading "
" to pip cache Collecting torch==1.12.1+cu116 File was already downloaded /home/boris/.clearml/pip-download-cache/cu117/torch-1.12.1+cu116-cp310-cp310-linux_x86_64.whl Successfully downloaded torch INFO:clearml_agent.commands.worker:Downloading "
` " to pip cache
Collecting torchvision==0.13.1+cu116
File was...
I dont have a short version.
I am using community clearml. How do I find out my version?
I guess I am out of ideas. The config is wrong somewhere. Maybe double check all the configs? It’s taking the value from somewhere!
Let me get the exact error for you
(agent) adamastor@adamastor:~/clearml_agent$ python -c "import torch; print(torch.__version__)" 1.12.1
I tried it.
This time agent was run with docker image python ( https://hub.docker.com/_/python )
Gets stuck onInstalling collected packages: six, python-dateutil, pathlib2, psutil, attrs, pyrsistent, jsonschema, idna, chardet, certifi, urllib3, requests, PyYAML, pyparsing, pyjwt, pyhocon, orderedmultidict, furl, future, platformdirs, filelock, distlib, virtualenv, clearml-agent
ps aux inside the container reads
` (base) boris@adamastor:~$ docker exec -it angry_edison bash
root@041c0736c...
So the only process is something called /usr/local/bin/python3.10 -u -m clearml_agent execute
.
So I guess pip install finished working
But the task is evidently not being executed.
So I guess the container cant access the clearml api because of localhost?
Here’s the agent config. It’s basically default
https://justpaste.it/4ozm3
For a hacky way you can do docker ps
and see the docker run command. I believe it contains the task id, so you can grep by task id
Pytorch is configured on the machine that’s running the agent. It’s also in requirements
I have no idea what it is doing
But what should I do? It does not work, it says incorrect password as you can see
Here’s the error I get:
https://justpaste.it/7aom5
It’s trying to downgrade pytorch to 1.12.1 for some reason (why?) using a version for an outdated CUDA (I have 11.7, it tries to use pytorch for CUDA 11.6). Finally crashes
What I am seeing is that the agent always fails trying to install some packages when I am not asking it at all
Btw it seems the docker runs in network=host
CostlyOstrich36 CLEARML-AGENT version 1.3.0
The issue disappeared after I switched from docker mode to pip mode
When trying it I realized that my local clearml.conf
had the old hostnames still ( adamastor.gaiavf.local
). Now your script returns the proper value of http://adamastor-office.periploinnovation.com:8081 . I will see if it works now!
Looking through history I found this link: None
Tldr: ClearML doesn’t support lightning
, but supports pytorch_lightning
. Downgrading from the new interface to the old one fixed my issue