Reputation
Badges 1
75 × Eureka!Yes, the git user is correct. It does not display the password of course. I tested and the config is definitely coming from clearml.conf
Still, the error persists
I don’t understand. The current cuda version is 11.7. Installed pytorch version is 1.12.1. Torch can access GPUs, all is fine.
Why does it try to install a different torch version?
` (base) boris@adamastor:~$ nvidia-smi
Fri Oct 7 14:16:24 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01 Driver Version: 515.65.01 CUDA Version: 11.7 |
|-------------------------------+----------------------+----------------------+
| GPU Name ...
Ok, it makes sense. But it’s running in docker mode and it is trying to ssh into the host machine and failing
I dont have a short version.
I am using community clearml. How do I find out my version?
Despite having manually installed this torch version, during task execution agent still tries to install it somehow and fails:INFO:clearml_agent.commands.worker:Downloading " " to pip cache Collecting torch==1.12.1+cu116 File was already downloaded /home/boris/.clearml/pip-download-cache/cu117/torch-1.12.1+cu116-cp310-cp310-linux_x86_64.whl Successfully downloaded torch INFO:clearml_agent.commands.worker:Downloading " ` " to pip cache
Collecting torchvision==0.13.1+cu116
File was...
Here’s the error I get:
https://justpaste.it/7aom5
It’s trying to downgrade pytorch to 1.12.1 for some reason (why?) using a version for an outdated CUDA (I have 11.7, it tries to use pytorch for CUDA 11.6). Finally crashes
Let me get the exact error for you
When trying it I realized that my local clearml.conf had the old hostnames still ( adamastor.gaiavf.local ). Now your script returns the proper value of http://adamastor-office.periploinnovation.com:8081 . I will see if it works now!
I start clearml-session on my mac this way:clearml-session --queue gpu --docker registry.gitlab.com/periplo-innovation/project-precog/clearml_config
The agent is started from a non-root user if that matters
I understand the idea, it makes sense. But it does not seem to work as intended. Why does it try to install a different pytorch? And why does it fail if it works if I do it manually? The env that’s executing the task has the same pytorch
Anyways, what should I do? So far my workers have not executed a single task, it always breaks with these env errors
(base) boris@adamastor:~/clearml_config$ clearml-agent --version CLEARML-AGENT version 1.4.0
Yes, I am able to clone locally on the same server the agent is running on. However I do it using ssh auth
CostlyOstrich36 in installed packages it has:
` # Python 3.10.6 | packaged by conda-forge | (main, Aug 22 2022, 20:41:22) [Clang 13.0.1 ]
Pillow == 9.2.0
clearml == 1.7.1
minio == 7.1.12
numpy == 1.23.1
pandas == 1.5.0
scikit_learn == 1.1.2
tensorboard == 2.10.1
torch == 1.12.1
torchvision == 0.13.1
tqdm == 4.64.1 `Which is the same as I have locally and on the server that runs clearml-agent
Is there a way to check if the port is accessible from my local machine?
"realmodelonly.pkl"
should be the full path, or just the file name?
Btw it seems the docker runs in network=host
Looking through history I found this link: None
Tldr: ClearML doesn’t support lightning , but supports pytorch_lightning . Downgrading from the new interface to the old one fixed my issue
I am doing clearml-agent --docker … --foreground --gpus 1
Upgraded, the issue persists
Sure, will send in a few min when it executes
AgitatedDove14 This example does not specify how to start a clearml-agent with docker such that it actually executes the task
Pytorch is configured on the machine that’s running the agent. It’s also in requirements