Why Does My Task Execution Freeze After Pip Installation (Running Agent In Foreground Mode)? The Task Is:

Answered

Why does my task execution freeze after pip installation (running agent in foreground mode)?

The task is:
` from clearml import Task
task = Task.init(project_name='Adhoc', task_name='GPU test')
task.execute_remotely(queue_name="test")

import torch

if torch.cuda.is_available():
a = torch.randn(3, 5)
b = torch.randn(3, 5)

a.cuda()
b.cuda()
print(a + b)

else:
a = torch.randn(3, 5)
b = torch.randn(3, 5)
print(a + b) `

  				
Posted 
	2 years ago

					More  		
  Report
		
					AdventurousButterfly15
				
					0
					 × 1

Votes Newest

Answers 22

Is there some minimal example of a docker env agent I can run, just to see that it works?

  				
Posted 
	2 years ago

					More  		
  Report
		
					AdventurousButterfly15
				
					0
					 × 1

Yep 🙂

  				
Posted 
	2 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Is there a way to debug what is happening?

  				
Posted 
	2 years ago

					More  		
  Report
		
					AdventurousButterfly15
				
					0
					 × 1

AdventurousButterfly15 The fact that it tries to ping localhost means you are running the ClearML server locally right? In that case, it is a docker thing: it cannot access localhost because localhost inside a docker image is not the same one as your machine itself. They're isolated.

That said, adding --network=host to the docker command usually fixes this by connecting the container to the local network instead of the internal docker one.

You can add a custom argument either in the webui after you clone (see screenshot) or by using set_base_docker https://clear.ml/docs/latest/docs/references/sdk/task#set_base_docker

So normally task.set_base_docker(docker_arguments='--network=host') should work. Fingers crossed 🤞

(I don't have a local server available now, but if you want I can try to recreate later)

  				
Posted 
	2 years ago

					More  		
  Report
		
					ExasperatedCrab78
				
					0
					 × 1

AdventurousButterfly15 this one is quite self container:
https://github.com/allegroai/clearml/blob/master/examples/reporting/scalar_reporting.py

So I guess pip install finished working
But the task is evidently not being executed.

This is very odd ... you can run the agent with debugging with --debug --foreground to see all the outputs and logs

  				
Posted 
	2 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

I guess this pip package installation happens as part of docker build

  				
Posted 
	2 years ago

					More  		
  Report
		
					AdventurousButterfly15
				
					0
					 × 1

AgitatedDove14 With --debug I see that after installing packages there is an endless stream of this:
Retrying (Retry(total=239, connect=239, read=240, redirect=240, status=240)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fac842e8be0>: Failed to establish a new connection: [Errno 111] Connection refused',)': /auth.login Retrying (Retry(total=238, connect=238, read=240, redirect=240, status=240)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fac842b42b0>: Failed to establish a new connection: [Errno 111] Connection refused',)': /auth.login DEBUG:urllib3.connectionpool:Resetting dropped connection: localhost DEBUG:urllib3.connectionpool: "POST /events.add_batch HTTP/1.1" 200 315 DEBUG:urllib3.connectionpool:Resetting dropped connection: localhost DEBUG:urllib3.connectionpool: "GET /v2.14/tasks.get_all HTTP/1.1" 200 344 DEBUG:urllib3.connectionpool:Resetting dropped connection: localhost DEBUG:urllib3.connectionpool: "POST /events.add_batch HTTP/1.1" 200 315 DEBUG:urllib3.connectionpool:Resetting dropped connection: localhost DEBUG:urllib3.connectionpool: "GET /v2.14/tasks.get_all HTTP/1.1" 200 344 DEBUG:urllib3.connectionpool:Resetting dropped connection: localhost DEBUG:urllib3.connectionpool: "POST /events.add_batch HTTP/1.1" 200 315 DEBUG:urllib3.connectionpool:Resetting dropped connection: localhost DEBUG:urllib3.connectionpool: "GET /v2.14/tasks.get_all HTTP/1.1" 200 344 DEBUG:urllib3.connectionpool:Resetting dropped connection: localhost DEBUG:urllib3.connectionpool: "GET /v2.14/tasks.get_all HTTP/1.1" 200 344 DEBUG:urllib3.connectionpool:Resetting dropped connection: localhost DEBUG:urllib3.connectionpool: "GET /v2.14/tasks.get_all HTTP/1.1" 200 344 DEBUG:urllib3.connectionpool:Resetting dropped connection: localhost

  				
Posted 
	2 years ago

					More  		
  Report
		
					AdventurousButterfly15
				
					0
					 × 1

I tried it.
This time agent was run with docker image python ( https://hub.docker.com/_/python )

Gets stuck on
Installing collected packages: six, python-dateutil, pathlib2, psutil, attrs, pyrsistent, jsonschema, idna, chardet, certifi, urllib3, requests, PyYAML, pyparsing, pyjwt, pyhocon, orderedmultidict, furl, future, platformdirs, filelock, distlib, virtualenv, clearml-agent
ps aux inside the container reads
(base) boris@adamastor:~$ docker exec -it angry_edison bash root@041c0736c30e:/# ps aux USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 1 0.3 0.0 53528 48212 pts/0 Ss+ 10:41 0:00 /usr/local/bin/python3.10 -u -m clearml_agent execute --disable-monitoring --id 4e1c0da367774a1087505c9d71e3f0a0 root 353 1.5 0.0 6056 3832 pts/1 Ss 10:44 0:00 bash root 359 0.0 0.0 8652 3216 pts/1 R+ 10:44 0:00 ps aux

  				
Posted 
	2 years ago

					More  		
  Report
		
					AdventurousButterfly15
				
					0
					 × 1

So I guess the container cant access the clearml api because of localhost?

  				
Posted 
	2 years ago

					More  		
  Report
		
					AdventurousButterfly15
				
					0
					 × 1

So the only process is something called /usr/local/bin/python3.10 -u -m clearml_agent execute .
So I guess pip install finished working
But the task is evidently not being executed.

  				
Posted 
	2 years ago

					More  		
  Report
		
					AdventurousButterfly15
				
					0
					 × 1

This issue was resolved by setting the correct clearml.conf (replacing localhost with a public hostname for the server) 🙂

  				
Posted 
	2 years ago

					More  		
  Report
		
					AdventurousButterfly15
				
					0
					 × 1

If the same happens in venv mode, see if pip process actually finished (you can find it with ps -Af | grep pip )

  				
Posted 
	2 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

(But in venv mode is also hangs the same way)

Hmm this is strange, could it be you are running out of storage ?

  				
Posted 
	2 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

AgitatedDove14 This example does not specify how to start a clearml-agent with docker such that it actually executes the task

  				
Posted 
	2 years ago

					More  		
  Report
		
					AdventurousButterfly15
				
					0
					 × 1

Good idea. I can just ssh into the container of task execution, right?

  				
Posted 
	2 years ago

					More  		
  Report
		
					AdventurousButterfly15
				
					0
					 × 1

Definitely not, the machine has 5 TB and is a recent clear install

  				
Posted 
	2 years ago

					More  		
  Report
		
					AdventurousButterfly15
				
					0
					 × 1

Freezing means that after the pip packages installation, pictured on screenshot, nothing happens. This screen hangs forever. No other output anywhere, including the web UI

  				
Posted 
	2 years ago

					More  		
  Report
		
					AdventurousButterfly15
				
					0
					 × 1

Agent is running in docker mode. The host OS is ubuntu

  				
Posted 
	2 years ago

					More  		
  Report
		
					AdventurousButterfly15
				
					0
					 × 1

Why does my task execution freeze after pip installation (running agent in foreground mode)?

Hi AdventurousButterfly15
Are you running in agent docker mode or venv mode ?
What do you mean freeze? do you see anything on the Taks console log in the UI? what's the host OS ?

  				
Posted 
	2 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

The image I am using is pytorch/pytorch:1.7.0-cuda11.0-cudnn8-devel

  				
Posted 
	2 years ago

					More  		
  Report
		
					AdventurousButterfly15
				
					0
					 × 1

(But in venv mode is also hangs the same way)

  				
Posted 
	2 years ago

					More  		
  Report
		
					AdventurousButterfly15
				
					0
					 × 1

Yes it seems so 😞

  				
Posted 
	2 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Write your answer

2K Views

22 Answers

2 years ago