Why does my task execution freeze after pip installation (running agent in foreground mode)?
Hi AdventurousButterfly15
Are you running in agent docker mode or venv mode ?
What do you mean freeze? do you see anything on the Taks console log in the UI? what's the host OS ?
Freezing means that after the pip packages installation, pictured on screenshot, nothing happens. This screen hangs forever. No other output anywhere, including the web UI
(But in venv mode is also hangs the same way)
Hmm this is strange, could it be you are running out of storage ?
I guess this pip package installation happens as part of docker build
The image I am using is pytorch/pytorch:1.7.0-cuda11.0-cudnn8-devel
Agent is running in docker mode. The host OS is ubuntu
(But in venv mode is also hangs the same way)
I tried it.
This time agent was run with docker image python ( https://hub.docker.com/_/python )
Gets stuck onInstalling collected packages: six, python-dateutil, pathlib2, psutil, attrs, pyrsistent, jsonschema, idna, chardet, certifi, urllib3, requests, PyYAML, pyparsing, pyjwt, pyhocon, orderedmultidict, furl, future, platformdirs, filelock, distlib, virtualenv, clearml-agent
ps aux inside the container reads(base) boris@adamastor:~$ docker exec -it angry_edison bash root@041c0736c30e:/# ps aux USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 1 0.3 0.0 53528 48212 pts/0 Ss+ 10:41 0:00 /usr/local/bin/python3.10 -u -m clearml_agent execute --disable-monitoring --id 4e1c0da367774a1087505c9d71e3f0a0 root 353 1.5 0.0 6056 3832 pts/1 Ss 10:44 0:00 bash root 359 0.0 0.0 8652 3216 pts/1 R+ 10:44 0:00 ps aux
So the only process is something called /usr/local/bin/python3.10 -u -m clearml_agent execute
.
So I guess pip install finished working
But the task is evidently not being executed.
If the same happens in venv mode, see if pip process actually finished (you can find it with ps -Af | grep pip
)
Good idea. I can just ssh into the container of task execution, right?
Definitely not, the machine has 5 TB and is a recent clear install
Is there a way to debug what is happening?
AgitatedDove14 With --debug
I see that after installing packages there is an endless stream of this:Retrying (Retry(total=239, connect=239, read=240, redirect=240, status=240)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fac842e8be0>: Failed to establish a new connection: [Errno 111] Connection refused',)': /auth.login Retrying (Retry(total=238, connect=238, read=240, redirect=240, status=240)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fac842b42b0>: Failed to establish a new connection: [Errno 111] Connection refused',)': /auth.login DEBUG:urllib3.connectionpool:Resetting dropped connection: localhost DEBUG:urllib3.connectionpool:
"POST /events.add_batch HTTP/1.1" 200 315 DEBUG:urllib3.connectionpool:Resetting dropped connection: localhost DEBUG:urllib3.connectionpool:
"GET /v2.14/tasks.get_all HTTP/1.1" 200 344 DEBUG:urllib3.connectionpool:Resetting dropped connection: localhost DEBUG:urllib3.connectionpool:
"POST /events.add_batch HTTP/1.1" 200 315 DEBUG:urllib3.connectionpool:Resetting dropped connection: localhost DEBUG:urllib3.connectionpool:
"GET /v2.14/tasks.get_all HTTP/1.1" 200 344 DEBUG:urllib3.connectionpool:Resetting dropped connection: localhost DEBUG:urllib3.connectionpool:
"POST /events.add_batch HTTP/1.1" 200 315 DEBUG:urllib3.connectionpool:Resetting dropped connection: localhost DEBUG:urllib3.connectionpool:
"GET /v2.14/tasks.get_all HTTP/1.1" 200 344 DEBUG:urllib3.connectionpool:Resetting dropped connection: localhost DEBUG:urllib3.connectionpool:
"GET /v2.14/tasks.get_all HTTP/1.1" 200 344 DEBUG:urllib3.connectionpool:Resetting dropped connection: localhost DEBUG:urllib3.connectionpool:
"GET /v2.14/tasks.get_all HTTP/1.1" 200 344 DEBUG:urllib3.connectionpool:Resetting dropped connection: localhost
AdventurousButterfly15 this one is quite self container:
https://github.com/allegroai/clearml/blob/master/examples/reporting/scalar_reporting.py
So I guess pip install finished working
But the task is evidently not being executed.
This is very odd ... you can run the agent with debugging with --debug --foreground to see all the outputs and logs
Is there some minimal example of a docker env agent I can run, just to see that it works?
AgitatedDove14 This example does not specify how to start a clearml-agent with docker such that it actually executes the task
So I guess the container cant access the clearml api because of localhost?
AdventurousButterfly15 The fact that it tries to ping localhost means you are running the ClearML server locally right? In that case, it is a docker thing: it cannot access localhost
because localhost inside a docker image is not the same one as your machine itself. They're isolated.
That said, adding --network=host
to the docker command usually fixes this by connecting the container to the local network instead of the internal docker one.
You can add a custom argument either in the webui after you clone (see screenshot) or by using set_base_docker
https://clear.ml/docs/latest/docs/references/sdk/task#set_base_docker
So normally task.set_base_docker(docker_arguments='--network=host')
should work. Fingers crossed 🤞
(I don't have a local server available now, but if you want I can try to recreate later)
This issue was resolved by setting the correct clearml.conf
(replacing localhost with a public hostname for the server) 🙂