Hi All

Answered

Hi All

Hi All 🙂
I am trying to run a Hyperparameter Optimization Task, where the controller task is submitted to the services queue (and picked up by the the default worker clearml-agent-services that runs in the same machine as the ClearML Server). The training tasks should then be sent to other queues where standard clearml-agent will pick it up to be executed.

The weird thing is: the controller Task starts and is visible in the UI. It will install a few packages, and then it will be stuck in the Running status forever (the last log can be seen in the screenshot below). I also tried submitting dummy Python tasks, that just print("something") , and get the same result.
It's like it got stuck in the initial Executing: ['docker', 'run', '-t', '-l', 'clearml-worker-id=clearml-services:service:(....long command....) .

Interestingly, if I submit the controller task to any other queue with a standard clearml-agent ( not in service mode) it will be processed normally.

Any idea what I am missing? Thanks 🙏

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					StoutElephant16
				
					0
					 × 1

Votes Newest

Answers 16

@<1593051292383580160:profile|SoreSparrow36> thanks a lot, I'll try it out 😉 Did I get it right? You have the public DNSs for CLEARML_WEB_HOST and CLEARML_FILES_HOST (both without http:// or https://)?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					StoutElephant16
				
					0
					 × 1

@<1523701087100473344:profile|SuccessfulKoala55> but the problem still persists. Any other ideas?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					StoutElephant16
				
					0
					 × 1

Good idea!
So, my api server is CLEARML_API_HOST= None and I ran telnet apiserver 8008 and received:

Trying 172.18.0.6...
Connected to apiserver.
Escape character is '^]'.

It seems the container is able to resolve the address and connect.

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					StoutElephant16
				
					0
					 × 1

oh i see. you're talking about the agent-services, not a separate agent in a container.
yup, I've got the same thing going there.
fwiw...
for me, HOST_IP is 0.0.0.0 and the other "HOSTS" env vars don't contain "http" in them.
and my server is publicly reachable, not sure if that matter either.

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					SoreSparrow36
				
					0
					 × 1

In my environment I have defined CLEARML_API_HOST (hard coded in docker-compose), CLEARML_WEB_HOST , CLEARML_FILES_HOST , CLEARML_API_ACCESS_KEY , CLEARML_API_SECRET_KEY , CLEARML_AGENT_GIT_USER and CLEARML_AGENT_GIT_PASS .

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					StoutElephant16
				
					0
					 × 1

Hi @<1523702018001080320:profile|StoutElephant16> , is it possible there's some issue getting network access to the internet from that container?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

I ran into something similar during deployment. Hopefully this helps with your debugging: if the agent was launched separately from the rest of the stack, it may not have proper docker-DNS resolution to None . (e.g. if in the same docker-compose, perhaps you didnt add the backend network field, or if it was launched separately through docker run without an explicit external network defined)

if the agent's on the same machine, try docker network connect to add the agent to the same backend network used by the server stack.

if your backend is publicly accessibly, you can have your agent's clearml.conf file point to a publically reachable endpoint instead of None , this is what enables my remote workers to talk to my instance hosting clearml.

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					SoreSparrow36
				
					0
					 × 1

Here's my docker-compose, maybe I'm missing something 😄 And thanks again for the support 😉

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					StoutElephant16
				
					0
					 × 1

Currently I have the environment variable CLEARML_API_HOST= None set and CLEARML_HOST_IP is empty. I assume that the latter is not needed when the CLEARML_API_HOST is defined.

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					StoutElephant16
				
					0
					 × 1

UPDATE: Now the agent-services is working 🙂 I was able to solve it by providing CLEARML_API_HOST: ${CLEARML_API_HOST:- None } in my docker-compose instead of CLEARML_API_HOST: None , where the environment variable CLEARML_API_HOST was set as my public api address. So in other words, the traffic is going through the internet, back to the server (same machine) and now it seems to be working. Thanks @<1593051292383580160:profile|SoreSparrow36> and @<1523701087100473344:profile|SuccessfulKoala55>

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					StoutElephant16
				
					0
					 × 1

After about 8hrs running I finally got clearml_agent: ERROR: Connection Error: it seems *api_server* is misconfigured. Is this the ClearML API server None ?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					StoutElephant16
				
					0
					 × 1

OK, from the log it actually seems like it might be failing to connect to the ClearML server - can you try to exec into the container and ping the apiserver component? (it's in port 8008)

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Also can you attach a more complete log?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Hi @<1593051292383580160:profile|SoreSparrow36> , thanks a lot! I ran docker network connect backend clearml-agent-services and got the response:
Error response from daemon: endpoint with name clearml-agent-services already exists in network clearml_backend
It was expected because my docker-compose had the entry

  agent-services:
    networks:
      - backend

I can also resolve and curl None from the clearml-agent-services container.

I managed to access my public backend with other external workers not running in services mode.

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					StoutElephant16
				
					0
					 × 1

Hi @<1523701087100473344:profile|SuccessfulKoala55> Thanks! it seems the container is able to download packages, I attached the full log here 😉

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					StoutElephant16
				
					0
					 × 1

hm, you should be able to hit None if docker networking is working properly. it shouldn't need to go through the internet to get back to your machine.

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					SoreSparrow36
				
					0
					 × 1

Write your answer

2K Views

16 Answers

2 years ago