Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Why Does My Task Execution Freeze After Pip Installation (Running Agent In Foreground Mode)? The Task Is:

Why does my task execution freeze after pip installation (running agent in foreground mode)?

The task is:
` from clearml import Task
task = Task.init(project_name='Adhoc', task_name='GPU test')
task.execute_remotely(queue_name="test")

import torch

if torch.cuda.is_available():
a = torch.randn(3, 5)
b = torch.randn(3, 5)

a.cuda()
b.cuda()
print(a + b)

else:
a = torch.randn(3, 5)
b = torch.randn(3, 5)
print(a + b) `

  
  
Posted 2 years ago
Votes Newest

Answers 22


So I guess the container cant access the clearml api because of localhost?

  
  
Posted 2 years ago

Is there a way to debug what is happening?

  
  
Posted 2 years ago

Yep 🙂

  
  
Posted 2 years ago

Why does my task execution freeze after pip installation (running agent in foreground mode)?

Hi AdventurousButterfly15
Are you running in agent docker mode or venv mode ?
What do you mean freeze? do you see anything on the Taks console log in the UI? what's the host OS ?

  
  
Posted 2 years ago

Definitely not, the machine has 5 TB and is a recent clear install

  
  
Posted 2 years ago

Freezing means that after the pip packages installation, pictured on screenshot, nothing happens. This screen hangs forever. No other output anywhere, including the web UI

  
  
Posted 2 years ago

The image I am using is pytorch/pytorch:1.7.0-cuda11.0-cudnn8-devel

  
  
Posted 2 years ago

So the only process is something called /usr/local/bin/python3.10 -u -m clearml_agent execute .
So I guess pip install finished working
But the task is evidently not being executed.

  
  
Posted 2 years ago

If the same happens in venv mode, see if pip process actually finished (you can find it with ps -Af | grep pip )

  
  
Posted 2 years ago

Yes it seems so 😞

  
  
Posted 2 years ago

Agent is running in docker mode. The host OS is ubuntu

  
  
Posted 2 years ago

I tried it.
This time agent was run with docker image python ( https://hub.docker.com/_/python )

Gets stuck on
Installing collected packages: six, python-dateutil, pathlib2, psutil, attrs, pyrsistent, jsonschema, idna, chardet, certifi, urllib3, requests, PyYAML, pyparsing, pyjwt, pyhocon, orderedmultidict, furl, future, platformdirs, filelock, distlib, virtualenv, clearml-agent
ps aux inside the container reads
(base) boris@adamastor:~$ docker exec -it angry_edison bash root@041c0736c30e:/# ps aux USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 1 0.3 0.0 53528 48212 pts/0 Ss+ 10:41 0:00 /usr/local/bin/python3.10 -u -m clearml_agent execute --disable-monitoring --id 4e1c0da367774a1087505c9d71e3f0a0 root 353 1.5 0.0 6056 3832 pts/1 Ss 10:44 0:00 bash root 359 0.0 0.0 8652 3216 pts/1 R+ 10:44 0:00 ps aux

  
  
Posted 2 years ago

Good idea. I can just ssh into the container of task execution, right?

  
  
Posted 2 years ago

AdventurousButterfly15 The fact that it tries to ping localhost means you are running the ClearML server locally right? In that case, it is a docker thing: it cannot access localhost because localhost inside a docker image is not the same one as your machine itself. They're isolated.

That said, adding --network=host to the docker command usually fixes this by connecting the container to the local network instead of the internal docker one.

You can add a custom argument either in the webui after you clone (see screenshot) or by using set_base_docker https://clear.ml/docs/latest/docs/references/sdk/task#set_base_docker

So normally task.set_base_docker(docker_arguments='--network=host') should work. Fingers crossed 🤞

(I don't have a local server available now, but if you want I can try to recreate later)

  
  
Posted 2 years ago

(But in venv mode is also hangs the same way)

Hmm this is strange, could it be you are running out of storage ?

  
  
Posted 2 years ago

AdventurousButterfly15 this one is quite self container:
https://github.com/allegroai/clearml/blob/master/examples/reporting/scalar_reporting.py

So I guess pip install finished working
But the task is evidently not being executed.

This is very odd ... you can run the agent with debugging with --debug --foreground to see all the outputs and logs

  
  
Posted 2 years ago

(But in venv mode is also hangs the same way)

  
  
Posted 2 years ago

AgitatedDove14 This example does not specify how to start a clearml-agent with docker such that it actually executes the task

  
  
Posted 2 years ago

I guess this pip package installation happens as part of docker build

  
  
Posted 2 years ago

This issue was resolved by setting the correct clearml.conf (replacing localhost with a public hostname for the server) 🙂

  
  
Posted 2 years ago

AgitatedDove14 With --debug I see that after installing packages there is an endless stream of this:
Retrying (Retry(total=239, connect=239, read=240, redirect=240, status=240)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fac842e8be0>: Failed to establish a new connection: [Errno 111] Connection refused',)': /auth.login Retrying (Retry(total=238, connect=238, read=240, redirect=240, status=240)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fac842b42b0>: Failed to establish a new connection: [Errno 111] Connection refused',)': /auth.login DEBUG:urllib3.connectionpool:Resetting dropped connection: localhost DEBUG:urllib3.connectionpool: "POST /events.add_batch HTTP/1.1" 200 315 DEBUG:urllib3.connectionpool:Resetting dropped connection: localhost DEBUG:urllib3.connectionpool: "GET /v2.14/tasks.get_all HTTP/1.1" 200 344 DEBUG:urllib3.connectionpool:Resetting dropped connection: localhost DEBUG:urllib3.connectionpool: "POST /events.add_batch HTTP/1.1" 200 315 DEBUG:urllib3.connectionpool:Resetting dropped connection: localhost DEBUG:urllib3.connectionpool: "GET /v2.14/tasks.get_all HTTP/1.1" 200 344 DEBUG:urllib3.connectionpool:Resetting dropped connection: localhost DEBUG:urllib3.connectionpool: "POST /events.add_batch HTTP/1.1" 200 315 DEBUG:urllib3.connectionpool:Resetting dropped connection: localhost DEBUG:urllib3.connectionpool: "GET /v2.14/tasks.get_all HTTP/1.1" 200 344 DEBUG:urllib3.connectionpool:Resetting dropped connection: localhost DEBUG:urllib3.connectionpool: "GET /v2.14/tasks.get_all HTTP/1.1" 200 344 DEBUG:urllib3.connectionpool:Resetting dropped connection: localhost DEBUG:urllib3.connectionpool: "GET /v2.14/tasks.get_all HTTP/1.1" 200 344 DEBUG:urllib3.connectionpool:Resetting dropped connection: localhost

  
  
Posted 2 years ago

Is there some minimal example of a docker env agent I can run, just to see that it works?

  
  
Posted 2 years ago
1K Views
22 Answers
2 years ago
one year ago
Tags