Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hi All

Hi All 🙂
I am trying to run a Hyperparameter Optimization Task, where the controller task is submitted to the services queue (and picked up by the the default worker clearml-agent-services that runs in the same machine as the ClearML Server). The training tasks should then be sent to other queues where standard clearml-agent will pick it up to be executed.

The weird thing is: the controller Task starts and is visible in the UI. It will install a few packages, and then it will be stuck in the Running status forever (the last log can be seen in the screenshot below). I also tried submitting dummy Python tasks, that just print("something") , and get the same result.
It's like it got stuck in the initial Executing: ['docker', 'run', '-t', '-l', 'clearml-worker-id=clearml-services:service:(....long command....) .

Interestingly, if I submit the controller task to any other queue with a standard clearml-agent ( not in service mode) it will be processed normally.

Any idea what I am missing? Thanks 🙏
image

  
  
Posted 10 months ago
Votes Newest

Answers 16


Also can you attach a more complete log?

  
  
Posted 10 months ago

Good idea!
So, my api server is CLEARML_API_HOST= None and I ran telnet apiserver 8008 and received:

Trying 172.18.0.6...
Connected to apiserver.
Escape character is '^]'.

It seems the container is able to resolve the address and connect.

  
  
Posted 10 months ago

OK, from the log it actually seems like it might be failing to connect to the ClearML server - can you try to exec into the container and ping the apiserver component? (it's in port 8008)

  
  
Posted 10 months ago

After about 8hrs running I finally got clearml_agent: ERROR: Connection Error: it seems *api_server* is misconfigured. Is this the ClearML API server None ?

  
  
Posted 10 months ago

@<1523701087100473344:profile|SuccessfulKoala55> but the problem still persists. Any other ideas?

  
  
Posted 10 months ago

Currently I have the environment variable CLEARML_API_HOST= None set and CLEARML_HOST_IP is empty. I assume that the latter is not needed when the CLEARML_API_HOST is defined.

  
  
Posted 10 months ago

I ran into something similar during deployment. Hopefully this helps with your debugging: if the agent was launched separately from the rest of the stack, it may not have proper docker-DNS resolution to None . (e.g. if in the same docker-compose, perhaps you didnt add the backend network field, or if it was launched separately through docker run without an explicit external network defined)

if the agent's on the same machine, try docker network connect to add the agent to the same backend network used by the server stack.

if your backend is publicly accessibly, you can have your agent's clearml.conf file point to a publically reachable endpoint instead of None , this is what enables my remote workers to talk to my instance hosting clearml.

  
  
Posted 10 months ago

Hi @<1593051292383580160:profile|SoreSparrow36> , thanks a lot! I ran docker network connect backend clearml-agent-services and got the response:
Error response from daemon: endpoint with name clearml-agent-services already exists in network clearml_backend
It was expected because my docker-compose had the entry

  agent-services:
    networks:
      - backend 

I can also resolve and curl None from the clearml-agent-services container.

I managed to access my public backend with other external workers not running in services mode.

  
  
Posted 10 months ago

oh i see. you're talking about the agent-services, not a separate agent in a container.
yup, I've got the same thing going there.
fwiw...
for me, HOST_IP is 0.0.0.0 and the other "HOSTS" env vars don't contain "http" in them.
and my server is publicly reachable, not sure if that matter either.
image

  
  
Posted 10 months ago

Here's my docker-compose, maybe I'm missing something 😄 And thanks again for the support 😉

  
  
Posted 10 months ago

In my environment I have defined CLEARML_API_HOST (hard coded in docker-compose), CLEARML_WEB_HOST , CLEARML_FILES_HOST , CLEARML_API_ACCESS_KEY , CLEARML_API_SECRET_KEY , CLEARML_AGENT_GIT_USER and CLEARML_AGENT_GIT_PASS .

  
  
Posted 10 months ago

Hi @<1523702018001080320:profile|StoutElephant16> , is it possible there's some issue getting network access to the internet from that container?

  
  
Posted 10 months ago

Hi @<1523701087100473344:profile|SuccessfulKoala55> Thanks! it seems the container is able to download packages, I attached the full log here 😉

  
  
Posted 10 months ago

@<1593051292383580160:profile|SoreSparrow36> thanks a lot, I'll try it out 😉 Did I get it right? You have the public DNSs for CLEARML_WEB_HOST and CLEARML_FILES_HOST (both without http:// or https://)?

  
  
Posted 10 months ago

UPDATE: Now the agent-services is working 🙂 I was able to solve it by providing CLEARML_API_HOST: ${CLEARML_API_HOST:- None } in my docker-compose instead of CLEARML_API_HOST: None , where the environment variable CLEARML_API_HOST was set as my public api address. So in other words, the traffic is going through the internet, back to the server (same machine) and now it seems to be working. Thanks @<1593051292383580160:profile|SoreSparrow36> and @<1523701087100473344:profile|SuccessfulKoala55>

  
  
Posted 10 months ago

hm, you should be able to hit None if docker networking is working properly. it shouldn't need to go through the internet to get back to your machine.

  
  
Posted 10 months ago
486 Views
16 Answers
10 months ago
10 months ago
Tags
Similar posts