Reputation
Badges 1
75 × Eureka!@<1523701205467926528:profile|AgitatedDove14> thanks!
The image I am using is pytorch/pytorch:1.7.0-cuda11.0-cudnn8-devel
So the only process is something called /usr/local/bin/python3.10 -u -m clearml_agent execute .
So I guess pip install finished working
But the task is evidently not being executed.
Agent is running in docker mode. The host OS is ubuntu
Definitely not, the machine has 5 TB and is a recent clear install
(agent) adamastor@adamastor:~/clearml_agent$ python -c "import torch; print(torch.__version__)" 1.12.1
Was I right to put the credentials in clearml.conf on the machine I am starting the agent on?
Clearml conf is like this...stuff.. agent { git_user: "btseytlin" git_pass: "gitlab accesstoken" }
The failure is that it does not even run
Yes, I created a token and out it into aget.git_pass
I mean if I enter my host machine ssh password it works. But we will disable password auth in future, so it’s not an option
Is there a way to debug what is happening?
(But in venv mode is also hangs the same way)
I guess this pip package installation happens as part of docker build
So I guess the container cant access the clearml api because of localhost?
AgitatedDove14 With --debug I see that after installing packages there is an endless stream of this:
` Retrying (Retry(total=239, connect=239, read=240, redirect=240, status=240)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fac842e8be0>: Failed to establish a new connection: [Errno 111] Connection refused',)': /auth.login
Retrying (Retry(total=238, connect=238, read=240, redirect=240, status=240)) after connection broken by 'NewConnec...
For a hacky way you can do docker ps and see the docker run command. I believe it contains the task id, so you can grep by task id
I resolved the issues by making my own docker image and making all envs the same:
The env that runs clearml-agent The docker env for running tasks in The env that requests task execution (my client)
All ports are open (both agent machine and client machine are working within same VPN)
The task runs in a docker container if that’s relevant
What I am seeing is that the agent always fails trying to install some packages when I am not asking it at all
It's too much of a hack :)
Well I don’t want that! My local machine is a Mac with no GPU. But I want to execute my code on a server with GPUs. I don’t want my local environment, I want the one configured for the agent!
The issue disappeared after I switched from docker mode to pip mode
Is there some minimal example of a docker env agent I can run, just to see that it works?
This issue was resolved by setting the correct clearml.conf (replacing localhost with a public hostname for the server) 🙂
The task log is here:
the log on my local machine is here:
Yeah, pytorch is a must. This script is a testing one, but after this I need to train stuff on GPUs
But what should I do? It does not work, it says incorrect password as you can see
CostlyOstrich36 CLEARML-AGENT version 1.3.0