Also can you attach a more complete log?
I ran into something similar during deployment. Hopefully this helps with your debugging: if the agent was launched separately from the rest of the stack, it may not have proper docker-DNS resolution to None . (e.g. if in the same docker-compose, perhaps you didnt add the backend
network field, or if it was launched separately through docker run
without an explicit external network defined)
if the agent's on the same machine, try docker network connect
to add the agent to the same backend
network used by the server stack.
if your backend is publicly accessibly, you can have your agent's clearml.conf
file point to a publically reachable endpoint instead of None , this is what enables my remote workers to talk to my instance hosting clearml.
OK, from the log it actually seems like it might be failing to connect to the ClearML server - can you try to exec into the container and ping the apiserver component? (it's in port 8008)
hm, you should be able to hit None if docker networking is working properly. it shouldn't need to go through the internet to get back to your machine.
After about 8hrs running I finally got clearml_agent: ERROR: Connection Error: it seems *api_server* is misconfigured. Is this the ClearML API server
None ?
Good idea!
So, my api server is CLEARML_API_HOST=
None and I ran telnet apiserver 8008
and received:
Trying 172.18.0.6...
Connected to apiserver.
Escape character is '^]'.
It seems the container is able to resolve the address and connect.
Hi @<1523701087100473344:profile|SuccessfulKoala55> Thanks! it seems the container is able to download packages, I attached the full log here π
Here's my docker-compose, maybe I'm missing something π And thanks again for the support π
UPDATE: Now the agent-services
is working π I was able to solve it by providing CLEARML_API_HOST: ${CLEARML_API_HOST:-
None }
in my docker-compose instead of CLEARML_API_HOST:
None , where the environment variable CLEARML_API_HOST
was set as my public api address. So in other words, the traffic is going through the internet, back to the server (same machine) and now it seems to be working. Thanks @<1593051292383580160:profile|SoreSparrow36> and @<1523701087100473344:profile|SuccessfulKoala55>
@<1523701087100473344:profile|SuccessfulKoala55> but the problem still persists. Any other ideas?
Hi @<1593051292383580160:profile|SoreSparrow36> , thanks a lot! I ran docker network connect backend clearml-agent-services
and got the response:Error response from daemon: endpoint with name clearml-agent-services already exists in network clearml_backend
It was expected because my docker-compose had the entry
agent-services:
networks:
- backend
I can also resolve and curl None from the clearml-agent-services
container.
I managed to access my public backend with other external workers not running in services mode.
In my environment I have defined CLEARML_API_HOST
(hard coded in docker-compose), CLEARML_WEB_HOST
, CLEARML_FILES_HOST
, CLEARML_API_ACCESS_KEY
, CLEARML_API_SECRET_KEY
, CLEARML_AGENT_GIT_USER
and CLEARML_AGENT_GIT_PASS
.
oh i see. you're talking about the agent-services, not a separate agent in a container.
yup, I've got the same thing going there.
fwiw...
for me, HOST_IP is 0.0.0.0 and the other "HOSTS" env vars don't contain "http" in them.
and my server is publicly reachable, not sure if that matter either.
Currently I have the environment variable CLEARML_API_HOST=
None set and CLEARML_HOST_IP
is empty. I assume that the latter is not needed when the CLEARML_API_HOST
is defined.
Hi @<1523702018001080320:profile|StoutElephant16> , is it possible there's some issue getting network access to the internet from that container?