Hi All

Answered

Hi All

Hi All 🙂
I am trying to run a Hyperparameter Optimization Task, where the controller task is submitted to the services queue (and picked up by the the default worker clearml-agent-services that runs in the same machine as the ClearML Server). The training tasks should then be sent to other queues where standard clearml-agent will pick it up to be executed.

The weird thing is: the controller Task starts and is visible in the UI. It will install a few packages, and then it will be stuck in the Running status forever (the last log can be seen in the screenshot below). I also tried submitting dummy Python tasks, that just print("something") , and get the same result.
It's like it got stuck in the initial Executing: ['docker', 'run', '-t', '-l', 'clearml-worker-id=clearml-services:service:(....long command....) .

Interestingly, if I submit the controller task to any other queue with a standard clearml-agent ( not in service mode) it will be processed normally.

Any idea what I am missing? Thanks 🙏

  				
Posted 
	one year ago

					More  		
  Report
		
					StoutElephant16
				
					0
					 × 1

Votes Newest

Answers 16

After about 8hrs running I finally got clearml_agent: ERROR: Connection Error: it seems *api_server* is misconfigured. Is this the ClearML API server None ?

  				
Posted 
	one year ago

					More  		
  Report
		
					StoutElephant16
				
					0
					 × 1

Good idea!
So, my api server is CLEARML_API_HOST= None and I ran telnet apiserver 8008 and received:

Trying 172.18.0.6...
Connected to apiserver.
Escape character is '^]'.

It seems the container is able to resolve the address and connect.

  				
Posted 
	one year ago

					More  		
  Report
		
					StoutElephant16
				
					0
					 × 1

hm, you should be able to hit None if docker networking is working properly. it shouldn't need to go through the internet to get back to your machine.

  				
Posted 
	one year ago

					More  		
  Report
		
					SoreSparrow36
				
					0
					 × 1

Currently I have the environment variable CLEARML_API_HOST= None set and CLEARML_HOST_IP is empty. I assume that the latter is not needed when the CLEARML_API_HOST is defined.

  				
Posted 
	one year ago

					More  		
  Report
		
					StoutElephant16
				
					0
					 × 1

I ran into something similar during deployment. Hopefully this helps with your debugging: if the agent was launched separately from the rest of the stack, it may not have proper docker-DNS resolution to None . (e.g. if in the same docker-compose, perhaps you didnt add the backend network field, or if it was launched separately through docker run without an explicit external network defined)

if the agent's on the same machine, try docker network connect to add the agent to the same backend network used by the server stack.

if your backend is publicly accessibly, you can have your agent's clearml.conf file point to a publically reachable endpoint instead of None , this is what enables my remote workers to talk to my instance hosting clearml.

  				
Posted 
	one year ago

					More  		
  Report
		
					SoreSparrow36
				
					0
					 × 1

Here's my docker-compose, maybe I'm missing something 😄 And thanks again for the support 😉

  				
Posted 
	one year ago

					More  		
  Report
		
					StoutElephant16
				
					0
					 × 1

UPDATE: Now the agent-services is working 🙂 I was able to solve it by providing CLEARML_API_HOST: ${CLEARML_API_HOST:- None } in my docker-compose instead of CLEARML_API_HOST: None , where the environment variable CLEARML_API_HOST was set as my public api address. So in other words, the traffic is going through the internet, back to the server (same machine) and now it seems to be working. Thanks SoreSparrow36 and SuccessfulKoala55

  				
Posted 
	one year ago

					More  		
  Report
		
					StoutElephant16
				
					0
					 × 1

OK, from the log it actually seems like it might be failing to connect to the ClearML server - can you try to exec into the container and ping the apiserver component? (it's in port 8008)

  				
Posted 
	one year ago

					More  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

oh i see. you're talking about the agent-services, not a separate agent in a container.
yup, I've got the same thing going there.
fwiw...
for me, HOST_IP is 0.0.0.0 and the other "HOSTS" env vars don't contain "http" in them.
and my server is publicly reachable, not sure if that matter either.

  				
Posted 
	one year ago

					More  		
  Report
		
					SoreSparrow36
				
					0
					 × 1

Hi StoutElephant16 , is it possible there's some issue getting network access to the internet from that container?

  				
Posted 
	one year ago

					More  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

In my environment I have defined CLEARML_API_HOST (hard coded in docker-compose), CLEARML_WEB_HOST , CLEARML_FILES_HOST , CLEARML_API_ACCESS_KEY , CLEARML_API_SECRET_KEY , CLEARML_AGENT_GIT_USER and CLEARML_AGENT_GIT_PASS .

  				
Posted 
	one year ago

					More  		
  Report
		
					StoutElephant16
				
					0
					 × 1

Hi SuccessfulKoala55 Thanks! it seems the container is able to download packages, I attached the full log here 😉

  				
Posted 
	one year ago

					More  		
  Report
		
					StoutElephant16
				
					0
					 × 1

Hi SoreSparrow36 , thanks a lot! I ran docker network connect backend clearml-agent-services and got the response:
Error response from daemon: endpoint with name clearml-agent-services already exists in network clearml_backend
It was expected because my docker-compose had the entry

  agent-services:
    networks:
      - backend

I can also resolve and curl None from the clearml-agent-services container.

I managed to access my public backend with other external workers not running in services mode.

  				
Posted 
	one year ago

					More  		
  Report
		
					StoutElephant16
				
					0
					 × 1

SoreSparrow36 thanks a lot, I'll try it out 😉 Did I get it right? You have the public DNSs for CLEARML_WEB_HOST and CLEARML_FILES_HOST (both without http:// or https://)?

  				
Posted 
	one year ago

					More  		
  Report
		
					StoutElephant16
				
					0
					 × 1

Also can you attach a more complete log?

  				
Posted 
	one year ago

					More  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

SuccessfulKoala55 but the problem still persists. Any other ideas?

  				
Posted 
	one year ago

					More  		
  Report
		
					StoutElephant16
				
					0
					 × 1

Write your answer

1K Views

16 Answers

one year ago