
Reputation
Badges 1
75 × Eureka!The task log is here:
the log on my local machine is here:
I mean if I enter my host machine ssh password it works. But we will disable password auth in future, so it’s not an option
The issue was that nvidia-docker2
was not installed on the machine where I was trying to run the agent. Following this guide fixed it:
https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#docker
CostlyOstrich36 CLEARML-AGENT version 1.3.0
Sure, will send in a few min when it executes
But what should I do? It does not work, it says incorrect password as you can see
AgitatedDove14
made a new one:
https://pastebin.com/LxLFk7py
Despite having manually installed this torch version, during task execution agent still tries to install it somehow and fails:INFO:clearml_agent.commands.worker:Downloading "
" to pip cache Collecting torch==1.12.1+cu116 File was already downloaded /home/boris/.clearml/pip-download-cache/cu117/torch-1.12.1+cu116-cp310-cp310-linux_x86_64.whl Successfully downloaded torch INFO:clearml_agent.commands.worker:Downloading "
` " to pip cache
Collecting torchvision==0.13.1+cu116
File was...
Looking through history I found this link: None
Tldr: ClearML doesn’t support lightning
, but supports pytorch_lightning
. Downgrading from the new interface to the old one fixed my issue
I start clearml-session on my mac this way:clearml-session --queue gpu --docker registry.gitlab.com/periplo-innovation/project-precog/clearml_config
I resolved the issues by making my own docker image and making all envs the same:
The env that runs clearml-agent The docker env for running tasks in The env that requests task execution (my client)
I don’t understand. The current cuda version is 11.7. Installed pytorch version is 1.12.1. Torch can access GPUs, all is fine.
Why does it try to install a different torch version?
` (base) boris@adamastor:~$ nvidia-smi
Fri Oct 7 14:16:24 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01 Driver Version: 515.65.01 CUDA Version: 11.7 |
|-------------------------------+----------------------+----------------------+
| GPU Name ...
It's too much of a hack :)
Ok, it makes sense. But it’s running in docker mode and it is trying to ssh into the host machine and failing
The agent is started from a non-root user if that matters
When trying it I realized that my local clearml.conf
had the old hostnames still ( adamastor.gaiavf.local
). Now your script returns the proper value of http://adamastor-office.periploinnovation.com:8081 . I will see if it works now!
Yes, the git user is correct. It does not display the password of course. I tested and the config is definitely coming from clearml.conf
Still, the error persists
Is there a way to check if the port is accessible from my local machine?
(base) boris@adamastor:~/clearml_config$ clearml-agent --version CLEARML-AGENT version 1.4.0
I can telnet the port from my mac:(base) *[main][~/Documents/plant_age]$ telnet 192.168.1.55 10022 Trying 192.168.1.55... Connected to 192.168.1.55. Escape character is '^]'. SSH-2.0-OpenSSH_8.4p1 Debian-5+deb11u1 ^C
I also use TB.
I solved the issue by implementing my own ClearML logger
I dont have a short version.
I am using community clearml. How do I find out my version?
Also manually installing this torch version succeeds:
` (base) boris@adamastor:~$ python3.10 -m pip install /home/boris/.clearml/pip-download-cache/cu117/torch-1.12.1+cu116-cp310-cp310-linux_x86_64.whl
Processing ./.clearml/pip-download-cache/cu117/torch-1.12.1+cu116-cp310-cp310-linux_x86_64.whl
Requirement already satisfied: typing-extensions in ./miniconda3/lib/python3.10/site-packages (from torch==1.12.1+cu116) (4.3.0)
Installing collected packages: torch
Attempting uninstall: torch
...
Upgraded, the issue persists
Yes, I am able to clone locally on the same server the agent is running on. However I do it using ssh auth