
Reputation
Badges 1
75 × Eureka!I guess I am out of ideas. The config is wrong somewhere. Maybe double check all the configs? It’s taking the value from somewhere!
I start clearml-session on my mac this way:clearml-session --queue gpu --docker registry.gitlab.com/periplo-innovation/project-precog/clearml_config
Is there a way to check if the port is accessible from my local machine?
Ok, it makes sense. But it’s running in docker mode and it is trying to ssh into the host machine and failing
But what should I do? It does not work, it says incorrect password as you can see
The task log is here:
the log on my local machine is here:
The agent is started with this command:clearml-agent --debug daemon --queue gpu --gpus 0 --foreground --docker <gitlab org registry>/project-precog/clearml_config
All ports are open (both agent machine and client machine are working within same VPN)
The agent is started from a non-root user if that matters
I mean if I enter my host machine ssh password it works. But we will disable password auth in future, so it’s not an option
Btw it seems the docker runs in network=host
I can telnet the port from my mac:(base) *[main][~/Documents/plant_age]$ telnet 192.168.1.55 10022 Trying 192.168.1.55... Connected to 192.168.1.55. Escape character is '^]'. SSH-2.0-OpenSSH_8.4p1 Debian-5+deb11u1 ^C
Sure, will send in a few min when it executes
AgitatedDove14
made a new one:
https://pastebin.com/LxLFk7py
I also use TB.
I solved the issue by implementing my own ClearML logger
The issue was that nvidia-docker2
was not installed on the machine where I was trying to run the agent. Following this guide fixed it:
https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#docker
@<1523701205467926528:profile|AgitatedDove14> thanks!
I am doing clearml-agent --docker … --foreground --gpus 1
"realmodelonly.pkl"
should be the full path, or just the file name?
I resolved the issues by making my own docker image and making all envs the same:
The env that runs clearml-agent The docker env for running tasks in The env that requests task execution (my client)
Despite having manually installed this torch version, during task execution agent still tries to install it somehow and fails:INFO:clearml_agent.commands.worker:Downloading "
" to pip cache Collecting torch==1.12.1+cu116 File was already downloaded /home/boris/.clearml/pip-download-cache/cu117/torch-1.12.1+cu116-cp310-cp310-linux_x86_64.whl Successfully downloaded torch INFO:clearml_agent.commands.worker:Downloading "
` " to pip cache
Collecting torchvision==0.13.1+cu116
File was...
I understand the idea, it makes sense. But it does not seem to work as intended. Why does it try to install a different pytorch? And why does it fail if it works if I do it manually? The env that’s executing the task has the same pytorch
Anyways, what should I do? So far my workers have not executed a single task, it always breaks with these env errors
I don’t understand. The current cuda version is 11.7. Installed pytorch version is 1.12.1. Torch can access GPUs, all is fine.
Why does it try to install a different torch version?
` (base) boris@adamastor:~$ nvidia-smi
Fri Oct 7 14:16:24 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01 Driver Version: 515.65.01 CUDA Version: 11.7 |
|-------------------------------+----------------------+----------------------+
| GPU Name ...
Also manually installing this torch version succeeds:
` (base) boris@adamastor:~$ python3.10 -m pip install /home/boris/.clearml/pip-download-cache/cu117/torch-1.12.1+cu116-cp310-cp310-linux_x86_64.whl
Processing ./.clearml/pip-download-cache/cu117/torch-1.12.1+cu116-cp310-cp310-linux_x86_64.whl
Requirement already satisfied: typing-extensions in ./miniconda3/lib/python3.10/site-packages (from torch==1.12.1+cu116) (4.3.0)
Installing collected packages: torch
Attempting uninstall: torch
...
Was I right to put the credentials in clearml.conf
on the machine I am starting the agent on?
Clearml conf is like this...stuff.. agent { git_user: "btseytlin" git_pass: "gitlab accesstoken" }