So I guess the container cant access the clearml api because of localhost?
Is there a way to debug what is happening?
Why does my task execution freeze after pip installation (running agent in foreground mode)?
Hi AdventurousButterfly15
Are you running in agent docker mode or venv mode ?
What do you mean freeze? do you see anything on the Taks console log in the UI? what's the host OS ?
Definitely not, the machine has 5 TB and is a recent clear install
Freezing means that after the pip packages installation, pictured on screenshot, nothing happens. This screen hangs forever. No other output anywhere, including the web UI
The image I am using is pytorch/pytorch:1.7.0-cuda11.0-cudnn8-devel
So the only process is something called /usr/local/bin/python3.10 -u -m clearml_agent execute
.
So I guess pip install finished working
But the task is evidently not being executed.
If the same happens in venv mode, see if pip process actually finished (you can find it with ps -Af | grep pip
)
Agent is running in docker mode. The host OS is ubuntu
I tried it.
This time agent was run with docker image python ( https://hub.docker.com/_/python )
Gets stuck onInstalling collected packages: six, python-dateutil, pathlib2, psutil, attrs, pyrsistent, jsonschema, idna, chardet, certifi, urllib3, requests, PyYAML, pyparsing, pyjwt, pyhocon, orderedmultidict, furl, future, platformdirs, filelock, distlib, virtualenv, clearml-agent
ps aux inside the container reads(base) boris@adamastor:~$ docker exec -it angry_edison bash root@041c0736c30e:/# ps aux USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 1 0.3 0.0 53528 48212 pts/0 Ss+ 10:41 0:00 /usr/local/bin/python3.10 -u -m clearml_agent execute --disable-monitoring --id 4e1c0da367774a1087505c9d71e3f0a0 root 353 1.5 0.0 6056 3832 pts/1 Ss 10:44 0:00 bash root 359 0.0 0.0 8652 3216 pts/1 R+ 10:44 0:00 ps aux
Good idea. I can just ssh into the container of task execution, right?
AdventurousButterfly15 The fact that it tries to ping localhost means you are running the ClearML server locally right? In that case, it is a docker thing: it cannot access localhost
because localhost inside a docker image is not the same one as your machine itself. They're isolated.
That said, adding --network=host
to the docker command usually fixes this by connecting the container to the local network instead of the internal docker one.
You can add a custom argument either in the webui after you clone (see screenshot) or by using set_base_docker
https://clear.ml/docs/latest/docs/references/sdk/task#set_base_docker
So normally task.set_base_docker(docker_arguments='--network=host')
should work. Fingers crossed 🤞
(I don't have a local server available now, but if you want I can try to recreate later)
(But in venv mode is also hangs the same way)
Hmm this is strange, could it be you are running out of storage ?
AdventurousButterfly15 this one is quite self container:
https://github.com/allegroai/clearml/blob/master/examples/reporting/scalar_reporting.py
So I guess pip install finished working
But the task is evidently not being executed.
This is very odd ... you can run the agent with debugging with --debug --foreground to see all the outputs and logs
(But in venv mode is also hangs the same way)
AgitatedDove14 This example does not specify how to start a clearml-agent with docker such that it actually executes the task
I guess this pip package installation happens as part of docker build
This issue was resolved by setting the correct clearml.conf
(replacing localhost with a public hostname for the server) 🙂
AgitatedDove14 With --debug
I see that after installing packages there is an endless stream of this:Retrying (Retry(total=239, connect=239, read=240, redirect=240, status=240)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fac842e8be0>: Failed to establish a new connection: [Errno 111] Connection refused',)': /auth.login Retrying (Retry(total=238, connect=238, read=240, redirect=240, status=240)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fac842b42b0>: Failed to establish a new connection: [Errno 111] Connection refused',)': /auth.login DEBUG:urllib3.connectionpool:Resetting dropped connection: localhost DEBUG:urllib3.connectionpool:
"POST /events.add_batch HTTP/1.1" 200 315 DEBUG:urllib3.connectionpool:Resetting dropped connection: localhost DEBUG:urllib3.connectionpool:
"GET /v2.14/tasks.get_all HTTP/1.1" 200 344 DEBUG:urllib3.connectionpool:Resetting dropped connection: localhost DEBUG:urllib3.connectionpool:
"POST /events.add_batch HTTP/1.1" 200 315 DEBUG:urllib3.connectionpool:Resetting dropped connection: localhost DEBUG:urllib3.connectionpool:
"GET /v2.14/tasks.get_all HTTP/1.1" 200 344 DEBUG:urllib3.connectionpool:Resetting dropped connection: localhost DEBUG:urllib3.connectionpool:
"POST /events.add_batch HTTP/1.1" 200 315 DEBUG:urllib3.connectionpool:Resetting dropped connection: localhost DEBUG:urllib3.connectionpool:
"GET /v2.14/tasks.get_all HTTP/1.1" 200 344 DEBUG:urllib3.connectionpool:Resetting dropped connection: localhost DEBUG:urllib3.connectionpool:
"GET /v2.14/tasks.get_all HTTP/1.1" 200 344 DEBUG:urllib3.connectionpool:Resetting dropped connection: localhost DEBUG:urllib3.connectionpool:
"GET /v2.14/tasks.get_all HTTP/1.1" 200 344 DEBUG:urllib3.connectionpool:Resetting dropped connection: localhost
Is there some minimal example of a docker env agent I can run, just to see that it works?