Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hi All

Hi All πŸ™‚
I am trying to run a Hyperparameter Optimization Task, where the controller task is submitted to the services queue (and picked up by the the default worker clearml-agent-services that runs in the same machine as the ClearML Server). The training tasks should then be sent to other queues where standard clearml-agent will pick it up to be executed.

The weird thing is: the controller Task starts and is visible in the UI. It will install a few packages, and then it will be stuck in the Running status forever (the last log can be seen in the screenshot below). I also tried submitting dummy Python tasks, that just print("something") , and get the same result.
It's like it got stuck in the initial Executing: ['docker', 'run', '-t', '-l', 'clearml-worker-id=clearml-services:service:(....long command....) .

Interestingly, if I submit the controller task to any other queue with a standard clearml-agent ( not in service mode) it will be processed normally.

Any idea what I am missing? Thanks πŸ™
image

  
  
Posted one year ago
Votes Newest

Answers 16


In my environment I have defined CLEARML_API_HOST (hard coded in docker-compose), CLEARML_WEB_HOST , CLEARML_FILES_HOST , CLEARML_API_ACCESS_KEY , CLEARML_API_SECRET_KEY , CLEARML_AGENT_GIT_USER and CLEARML_AGENT_GIT_PASS .

  
  
Posted one year ago

After about 8hrs running I finally got clearml_agent: ERROR: Connection Error: it seems *api_server* is misconfigured. Is this the ClearML API server None ?

  
  
Posted one year ago

Hi SoreSparrow36 , thanks a lot! I ran docker network connect backend clearml-agent-services and got the response:
Error response from daemon: endpoint with name clearml-agent-services already exists in network clearml_backend
It was expected because my docker-compose had the entry

  agent-services:
    networks:
      - backend 

I can also resolve and curl None from the clearml-agent-services container.

I managed to access my public backend with other external workers not running in services mode.

  
  
Posted one year ago

I ran into something similar during deployment. Hopefully this helps with your debugging: if the agent was launched separately from the rest of the stack, it may not have proper docker-DNS resolution to None . (e.g. if in the same docker-compose, perhaps you didnt add the backend network field, or if it was launched separately through docker run without an explicit external network defined)

if the agent's on the same machine, try docker network connect to add the agent to the same backend network used by the server stack.

if your backend is publicly accessibly, you can have your agent's clearml.conf file point to a publically reachable endpoint instead of None , this is what enables my remote workers to talk to my instance hosting clearml.

  
  
Posted one year ago

SoreSparrow36 thanks a lot, I'll try it out πŸ˜‰ Did I get it right? You have the public DNSs for CLEARML_WEB_HOST and CLEARML_FILES_HOST (both without http:// or https://)?

  
  
Posted one year ago

hm, you should be able to hit None if docker networking is working properly. it shouldn't need to go through the internet to get back to your machine.

  
  
Posted one year ago

Hi StoutElephant16 , is it possible there's some issue getting network access to the internet from that container?

  
  
Posted one year ago

oh i see. you're talking about the agent-services, not a separate agent in a container.
yup, I've got the same thing going there.
fwiw...
for me, HOST_IP is 0.0.0.0 and the other "HOSTS" env vars don't contain "http" in them.
and my server is publicly reachable, not sure if that matter either.
image

  
  
Posted one year ago

Currently I have the environment variable CLEARML_API_HOST= None set and CLEARML_HOST_IP is empty. I assume that the latter is not needed when the CLEARML_API_HOST is defined.

  
  
Posted one year ago

Hi SuccessfulKoala55 Thanks! it seems the container is able to download packages, I attached the full log here πŸ˜‰

  
  
Posted one year ago

Here's my docker-compose, maybe I'm missing something πŸ˜„ And thanks again for the support πŸ˜‰

  
  
Posted one year ago

SuccessfulKoala55 but the problem still persists. Any other ideas?

  
  
Posted one year ago

OK, from the log it actually seems like it might be failing to connect to the ClearML server - can you try to exec into the container and ping the apiserver component? (it's in port 8008)

  
  
Posted one year ago

Good idea!
So, my api server is CLEARML_API_HOST= None and I ran telnet apiserver 8008 and received:

Trying 172.18.0.6...
Connected to apiserver.
Escape character is '^]'.

It seems the container is able to resolve the address and connect.

  
  
Posted one year ago

UPDATE: Now the agent-services is working πŸ™‚ I was able to solve it by providing CLEARML_API_HOST: ${CLEARML_API_HOST:- None } in my docker-compose instead of CLEARML_API_HOST: None , where the environment variable CLEARML_API_HOST was set as my public api address. So in other words, the traffic is going through the internet, back to the server (same machine) and now it seems to be working. Thanks SoreSparrow36 and SuccessfulKoala55

  
  
Posted one year ago

Also can you attach a more complete log?

  
  
Posted one year ago
920 Views
16 Answers
one year ago
one year ago
Tags
Similar posts