Reputation
Badges 1
32 × Eureka!Here's my docker-compose, maybe I'm missing something 😄 And thanks again for the support 😉
But I still have one thing I'd like to fix: the health check for the file server on port 8081 gives me unhealthy for path "/". Is there a valid path you know I can use there for health checks? A curl gives me
Ok, I think that's been very helpful 🙂 I'll experiment a little, now that I know a Health Check that must work. I'll write here if I find something! Thanks a lot for the awesome support!
@<1523701087100473344:profile|SuccessfulKoala55> but the problem still persists. Any other ideas?
JuicyFox94 I think I found the problem. To my absolute shame, the security group of the ALB had no Outbound rules, i.e. no traffic was allowed out of the ALB 🙈 . Now I can access the ClearML Webserver!
This gives me a 200 🙂
Hi @<1523701087100473344:profile|SuccessfulKoala55> Thanks! it seems the container is able to download packages, I attached the full log here 😉
In my environment I have defined CLEARML_API_HOST (hard coded in docker-compose), CLEARML_WEB_HOST , CLEARML_FILES_HOST , CLEARML_API_ACCESS_KEY , CLEARML_API_SECRET_KEY , CLEARML_AGENT_GIT_USER and CLEARML_AGENT_GIT_PASS .
These are the seetings for health check now
Thanks a lot for the help debugging!
Hi @<1593051292383580160:profile|SoreSparrow36> , thanks a lot! I ran docker network connect backend clearml-agent-services and got the response:Error response from daemon: endpoint with name clearml-agent-services already exists in network clearml_backend
It was expected because my docker-compose had the entry
agent-services:
networks:
- backend
I can also resolve and curl None from the clearml-agent-services container.
I managed...
Hey SuccessfulKoala55 . I use my own custom Daemon that in turn runs clearml-agent execute for some complicated reasons (other correlated processes) I want to be able to fetch and execute only certain task id, instead of pulling one from the queue.
Yes exactly, like a cron Job. Thanks a lot!
And it's still unhealthy. I am starting to suspect that somehow the Autoscaling Part in between the ALB and the ClearML server could be causing the problem.
I have this block in my docker compose:
agent-services:
networks:
- backend
container_name: clearml-agent-services
image: allegroai/clearml-agent-services:latest
deploy:
restart_policy:
condition: on-failure
privileged: true
environment:
<....>
volumes:
- /var/run/docker.sock:/var/run/docker.sock
- /opt/clearml/agent:/root/.clearml
depends_on:
- apiserver
entrypoint: >
bash -c "curl --retry 10 --retr...
UPDATE: setting SHUTDOWN_IF_NO_ACCESS_KEY: 1 allowed me to see the agent-services container, and then a docker inspect clearml-agent-services showed me that the environment variables needed for the agent in the docker-compose.yml were empty. So the problem was in my bootstrap script.
Because SHUTDOWN_IF_NO_ACCESS_KEY was set to 0 before, the container would disappear 🙂
Thanks a lot for helping me figure this out!
Currently I'm "cheating" and counting a 405 as the success code for the healthcheck.
Ok thanks a lot for the Info! For now (as a simple error handling): is there any way I can tell the ClearML Server that the experiment should be cancelled using the shell?
And I could access the web server even if the health check was failing. So that was not a problem in the end.
Thanks a lot! Yes, I don't see such a worker in the UI. docker ps returns the containers below. I suppose the clearml-apiserver is the relevant one.