Reputation
Badges 1
32 × Eureka!I left the environment variables out to keep things short, but there is one SHUTDOWN_IF_NO_ACCESS_KEY: 1
. Maybe some authentication is failing and the container is stopping.
I have this block in my docker compose:
agent-services:
networks:
- backend
container_name: clearml-agent-services
image: allegroai/clearml-agent-services:latest
deploy:
restart_policy:
condition: on-failure
privileged: true
environment:
<....>
volumes:
- /var/run/docker.sock:/var/run/docker.sock
- /opt/clearml/agent:/root/.clearml
depends_on:
- apiserver
entrypoint: >
bash -c "curl --retry 10 --retr...
Ok, I think that's been very helpful 🙂 I'll experiment a little, now that I know a Health Check that must work. I'll write here if I find something! Thanks a lot for the awesome support!
And it's still unhealthy. I am starting to suspect that somehow the Autoscaling Part in between the ALB and the ClearML server could be causing the problem.
These are the seetings for health check now
Currently I'm "cheating" and counting a 405 as the success code for the healthcheck.
UPDATE: setting SHUTDOWN_IF_NO_ACCESS_KEY: 1
allowed me to see the agent-services
container, and then a docker inspect clearml-agent-services
showed me that the environment variables needed for the agent in the docker-compose.yml
were empty. So the problem was in my bootstrap script.
Because SHUTDOWN_IF_NO_ACCESS_KEY
was set to 0
before, the container would disappear 🙂
Thanks a lot for helping me figure this out!
This gives me a 200 🙂
Thanks a lot for the help debugging!
JuicyFox94 I think I found the problem. To my absolute shame, the security group of the ALB had no Outbound rules, i.e. no traffic was allowed out of the ALB 🙈 . Now I can access the ClearML Webserver!
Hi @<1523701087100473344:profile|SuccessfulKoala55> Thanks! it seems the container is able to download packages, I attached the full log here 😉
Yes exactly, like a cron Job. Thanks a lot!
Ok thanks a lot for the Info! For now (as a simple error handling): is there any way I can tell the ClearML Server that the experiment should be cancelled using the shell?
But I still have one thing I'd like to fix: the health check for the file server on port 8081 gives me unhealthy for path "/". Is there a valid path you know I can use there for health checks? A curl gives me
And I could access the web server even if the health check was failing. So that was not a problem in the end.
Thanks a lot! Yes, I don't see such a worker in the UI. docker ps
returns the containers below. I suppose the clearml-apiserver
is the relevant one.
@<1523701087100473344:profile|SuccessfulKoala55> but the problem still persists. Any other ideas?
Hi @<1593051292383580160:profile|SoreSparrow36> , thanks a lot! I ran docker network connect backend clearml-agent-services
and got the response:Error response from daemon: endpoint with name clearml-agent-services already exists in network clearml_backend
It was expected because my docker-compose had the entry
agent-services:
networks:
- backend
I can also resolve and curl None from the clearml-agent-services
container.
I managed...
UPDATE: Now the agent-services
is working 🙂 I was able to solve it by providing CLEARML_API_HOST: ${CLEARML_API_HOST:-
None }
in my docker-compose instead of CLEARML_API_HOST:
None , where the environment variable CLEARML_API_HOST
was set as my public api address. So in other words, the traffic is going through the internet, back to the server (same machine) and now it seems to be working. Thanks @<1593051292383580160:...
In my environment I have defined CLEARML_API_HOST
(hard coded in docker-compose), CLEARML_WEB_HOST
, CLEARML_FILES_HOST
, CLEARML_API_ACCESS_KEY
, CLEARML_API_SECRET_KEY
, CLEARML_AGENT_GIT_USER
and CLEARML_AGENT_GIT_PASS
.