Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Why Does My Task Execution Freeze After Pip Installation (Running Agent In Foreground Mode)? The Task Is:

Why does my task execution freeze after pip installation (running agent in foreground mode)?

The task is:
` from clearml import Task
task = Task.init(project_name='Adhoc', task_name='GPU test')
task.execute_remotely(queue_name="test")

import torch

if torch.cuda.is_available():
a = torch.randn(3, 5)
b = torch.randn(3, 5)

a.cuda()
b.cuda()
print(a + b)

else:
a = torch.randn(3, 5)
b = torch.randn(3, 5)
print(a + b) `

  
  
Posted 2 years ago
Votes Newest

Answers 22


Why does my task execution freeze after pip installation (running agent in foreground mode)?

Hi AdventurousButterfly15
Are you running in agent docker mode or venv mode ?
What do you mean freeze? do you see anything on the Taks console log in the UI? what's the host OS ?

  
  
Posted 2 years ago

Freezing means that after the pip packages installation, pictured on screenshot, nothing happens. This screen hangs forever. No other output anywhere, including the web UI

  
  
Posted 2 years ago

(But in venv mode is also hangs the same way)

Hmm this is strange, could it be you are running out of storage ?

  
  
Posted 2 years ago

I guess this pip package installation happens as part of docker build

  
  
Posted 2 years ago

The image I am using is pytorch/pytorch:1.7.0-cuda11.0-cudnn8-devel

  
  
Posted 2 years ago

Agent is running in docker mode. The host OS is ubuntu

  
  
Posted 2 years ago

(But in venv mode is also hangs the same way)

  
  
Posted 2 years ago

I tried it.
This time agent was run with docker image python ( https://hub.docker.com/_/python )

Gets stuck on
Installing collected packages: six, python-dateutil, pathlib2, psutil, attrs, pyrsistent, jsonschema, idna, chardet, certifi, urllib3, requests, PyYAML, pyparsing, pyjwt, pyhocon, orderedmultidict, furl, future, platformdirs, filelock, distlib, virtualenv, clearml-agent
ps aux inside the container reads
(base) boris@adamastor:~$ docker exec -it angry_edison bash root@041c0736c30e:/# ps aux USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 1 0.3 0.0 53528 48212 pts/0 Ss+ 10:41 0:00 /usr/local/bin/python3.10 -u -m clearml_agent execute --disable-monitoring --id 4e1c0da367774a1087505c9d71e3f0a0 root 353 1.5 0.0 6056 3832 pts/1 Ss 10:44 0:00 bash root 359 0.0 0.0 8652 3216 pts/1 R+ 10:44 0:00 ps aux

  
  
Posted 2 years ago

So the only process is something called /usr/local/bin/python3.10 -u -m clearml_agent execute .
So I guess pip install finished working
But the task is evidently not being executed.

  
  
Posted 2 years ago

If the same happens in venv mode, see if pip process actually finished (you can find it with ps -Af | grep pip )

  
  
Posted 2 years ago

Good idea. I can just ssh into the container of task execution, right?

  
  
Posted 2 years ago

Definitely not, the machine has 5 TB and is a recent clear install

  
  
Posted 2 years ago

Is there a way to debug what is happening?

  
  
Posted 2 years ago

Yep 🙂

  
  
Posted 2 years ago

AgitatedDove14 With --debug I see that after installing packages there is an endless stream of this:
Retrying (Retry(total=239, connect=239, read=240, redirect=240, status=240)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fac842e8be0>: Failed to establish a new connection: [Errno 111] Connection refused',)': /auth.login Retrying (Retry(total=238, connect=238, read=240, redirect=240, status=240)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fac842b42b0>: Failed to establish a new connection: [Errno 111] Connection refused',)': /auth.login DEBUG:urllib3.connectionpool:Resetting dropped connection: localhost DEBUG:urllib3.connectionpool: "POST /events.add_batch HTTP/1.1" 200 315 DEBUG:urllib3.connectionpool:Resetting dropped connection: localhost DEBUG:urllib3.connectionpool: "GET /v2.14/tasks.get_all HTTP/1.1" 200 344 DEBUG:urllib3.connectionpool:Resetting dropped connection: localhost DEBUG:urllib3.connectionpool: "POST /events.add_batch HTTP/1.1" 200 315 DEBUG:urllib3.connectionpool:Resetting dropped connection: localhost DEBUG:urllib3.connectionpool: "GET /v2.14/tasks.get_all HTTP/1.1" 200 344 DEBUG:urllib3.connectionpool:Resetting dropped connection: localhost DEBUG:urllib3.connectionpool: "POST /events.add_batch HTTP/1.1" 200 315 DEBUG:urllib3.connectionpool:Resetting dropped connection: localhost DEBUG:urllib3.connectionpool: "GET /v2.14/tasks.get_all HTTP/1.1" 200 344 DEBUG:urllib3.connectionpool:Resetting dropped connection: localhost DEBUG:urllib3.connectionpool: "GET /v2.14/tasks.get_all HTTP/1.1" 200 344 DEBUG:urllib3.connectionpool:Resetting dropped connection: localhost DEBUG:urllib3.connectionpool: "GET /v2.14/tasks.get_all HTTP/1.1" 200 344 DEBUG:urllib3.connectionpool:Resetting dropped connection: localhost

  
  
Posted 2 years ago

AdventurousButterfly15 this one is quite self container:
https://github.com/allegroai/clearml/blob/master/examples/reporting/scalar_reporting.py

So I guess pip install finished working
But the task is evidently not being executed.

This is very odd ... you can run the agent with debugging with --debug --foreground to see all the outputs and logs

  
  
Posted 2 years ago

Is there some minimal example of a docker env agent I can run, just to see that it works?

  
  
Posted 2 years ago

AgitatedDove14 This example does not specify how to start a clearml-agent with docker such that it actually executes the task

  
  
Posted 2 years ago

So I guess the container cant access the clearml api because of localhost?

  
  
Posted 2 years ago

AdventurousButterfly15 The fact that it tries to ping localhost means you are running the ClearML server locally right? In that case, it is a docker thing: it cannot access localhost because localhost inside a docker image is not the same one as your machine itself. They're isolated.

That said, adding --network=host to the docker command usually fixes this by connecting the container to the local network instead of the internal docker one.

You can add a custom argument either in the webui after you clone (see screenshot) or by using set_base_docker https://clear.ml/docs/latest/docs/references/sdk/task#set_base_docker

So normally task.set_base_docker(docker_arguments='--network=host') should work. Fingers crossed 🤞

(I don't have a local server available now, but if you want I can try to recreate later)

  
  
Posted 2 years ago

This issue was resolved by setting the correct clearml.conf (replacing localhost with a public hostname for the server) 🙂

  
  
Posted 2 years ago

Yes it seems so 😞

  
  
Posted 2 years ago
1K Views
22 Answers
2 years ago
one year ago
Tags