Reputation
Badges 1
75 × Eureka!All ports are open (both agent machine and client machine are working within same VPN)
The agent is started from a non-root user if that matters
Is there a way to check if the port is accessible from my local machine?
I also use TB.
I solved the issue by implementing my own ClearML logger
Here’s the agent config. It’s basically default
https://justpaste.it/4ozm3
Yes, I created a token and out it into aget.git_pass
The agent is started with this command:clearml-agent --debug daemon --queue gpu --gpus 0 --foreground --docker <gitlab org registry>/project-precog/clearml_config
I can telnet the port from my mac:(base) *[main][~/Documents/plant_age]$ telnet 192.168.1.55 10022 Trying 192.168.1.55... Connected to 192.168.1.55. Escape character is '^]'. SSH-2.0-OpenSSH_8.4p1 Debian-5+deb11u1 ^C
Is there some minimal example of a docker env agent I can run, just to see that it works?
So I guess the container cant access the clearml api because of localhost?
Freezing means that after the pip packages installation, pictured on screenshot, nothing happens. This screen hangs forever. No other output anywhere, including the web UI
When trying it I realized that my local clearml.conf
had the old hostnames still ( adamastor.gaiavf.local
). Now your script returns the proper value of http://adamastor-office.periploinnovation.com:8081 . I will see if it works now!
Ok, it makes sense. But it’s running in docker mode and it is trying to ssh into the host machine and failing
But what should I do? It does not work, it says incorrect password as you can see
Let me get the exact error for you
CostlyOstrich36 in installed packages it has:
` # Python 3.10.6 | packaged by conda-forge | (main, Aug 22 2022, 20:41:22) [Clang 13.0.1 ]
Pillow == 9.2.0
clearml == 1.7.1
minio == 7.1.12
numpy == 1.23.1
pandas == 1.5.0
scikit_learn == 1.1.2
tensorboard == 2.10.1
torch == 1.12.1
torchvision == 0.13.1
tqdm == 4.64.1 `Which is the same as I have locally and on the server that runs clearml-agent
So the only process is something called /usr/local/bin/python3.10 -u -m clearml_agent execute
.
So I guess pip install finished working
But the task is evidently not being executed.
I mean if I enter my host machine ssh password it works. But we will disable password auth in future, so it’s not an option
What I am seeing is that the agent always fails trying to install some packages when I am not asking it at all
It's too much of a hack :)
This issue was resolved by setting the correct clearml.conf
(replacing localhost with a public hostname for the server) 🙂
Here’s the error I get:
https://justpaste.it/7aom5
It’s trying to downgrade pytorch to 1.12.1 for some reason (why?) using a version for an outdated CUDA (I have 11.7, it tries to use pytorch for CUDA 11.6). Finally crashes
Btw it seems the docker runs in network=host
Is there a way to debug what is happening?