Reputation
Badges 1
32 × Eureka!Hey SuccessfulKoala55 . I use my own custom Daemon that in turn runs clearml-agent execute
for some complicated reasons (other correlated processes) I want to be able to fetch and execute only certain task id, instead of pulling one from the queue.
Ok, I think that's been very helpful 🙂 I'll experiment a little, now that I know a Health Check that must work. I'll write here if I find something! Thanks a lot for the awesome support!
And it's still unhealthy. I am starting to suspect that somehow the Autoscaling Part in between the ALB and the ClearML server could be causing the problem.
UPDATE: Now the agent-services
is working 🙂 I was able to solve it by providing CLEARML_API_HOST: ${CLEARML_API_HOST:-
None }
in my docker-compose instead of CLEARML_API_HOST:
None , where the environment variable CLEARML_API_HOST
was set as my public api address. So in other words, the traffic is going through the internet, back to the server (same machine) and now it seems to be working. Thanks @<1593051292383580160:...
Currently I'm "cheating" and counting a 405 as the success code for the healthcheck.
@<1523701087100473344:profile|SuccessfulKoala55> but the problem still persists. Any other ideas?
In my environment I have defined CLEARML_API_HOST
(hard coded in docker-compose), CLEARML_WEB_HOST
, CLEARML_FILES_HOST
, CLEARML_API_ACCESS_KEY
, CLEARML_API_SECRET_KEY
, CLEARML_AGENT_GIT_USER
and CLEARML_AGENT_GIT_PASS
.
Ok thanks a lot for the Info! For now (as a simple error handling): is there any way I can tell the ClearML Server that the experiment should be cancelled using the shell?
UPDATE: setting SHUTDOWN_IF_NO_ACCESS_KEY: 1
allowed me to see the agent-services
container, and then a docker inspect clearml-agent-services
showed me that the environment variables needed for the agent in the docker-compose.yml
were empty. So the problem was in my bootstrap script.
Because SHUTDOWN_IF_NO_ACCESS_KEY
was set to 0
before, the container would disappear 🙂
Thanks a lot for helping me figure this out!
Hi @<1523701087100473344:profile|SuccessfulKoala55> Thanks! it seems the container is able to download packages, I attached the full log here 😉
I have this block in my docker compose:
agent-services:
networks:
- backend
container_name: clearml-agent-services
image: allegroai/clearml-agent-services:latest
deploy:
restart_policy:
condition: on-failure
privileged: true
environment:
<....>
volumes:
- /var/run/docker.sock:/var/run/docker.sock
- /opt/clearml/agent:/root/.clearml
depends_on:
- apiserver
entrypoint: >
bash -c "curl --retry 10 --retr...
Thanks a lot! Yes, I don't see such a worker in the UI. docker ps
returns the containers below. I suppose the clearml-apiserver
is the relevant one.
Here's my docker-compose, maybe I'm missing something 😄 And thanks again for the support 😉
Hi @<1593051292383580160:profile|SoreSparrow36> , thanks a lot! I ran docker network connect backend clearml-agent-services
and got the response:Error response from daemon: endpoint with name clearml-agent-services already exists in network clearml_backend
It was expected because my docker-compose had the entry
agent-services:
networks:
- backend
I can also resolve and curl None from the clearml-agent-services
container.
I managed...
I left the environment variables out to keep things short, but there is one SHUTDOWN_IF_NO_ACCESS_KEY: 1
. Maybe some authentication is failing and the container is stopping.
Thanks a lot for the help debugging!
JuicyFox94 I think I found the problem. To my absolute shame, the security group of the ALB had no Outbound rules, i.e. no traffic was allowed out of the ALB 🙈 . Now I can access the ClearML Webserver!
This gives me a 200 🙂
These are the seetings for health check now
Yes exactly, like a cron Job. Thanks a lot!
And I could access the web server even if the health check was failing. So that was not a problem in the end.