
Reputation
Badges 1
32 × Eureka!Hi @<1523701087100473344:profile|SuccessfulKoala55> Thanks! it seems the container is able to download packages, I attached the full log here 😉
In my environment I have defined CLEARML_API_HOST
(hard coded in docker-compose), CLEARML_WEB_HOST
, CLEARML_FILES_HOST
, CLEARML_API_ACCESS_KEY
, CLEARML_API_SECRET_KEY
, CLEARML_AGENT_GIT_USER
and CLEARML_AGENT_GIT_PASS
.
@<1523701087100473344:profile|SuccessfulKoala55> but the problem still persists. Any other ideas?
And it's still unhealthy. I am starting to suspect that somehow the Autoscaling Part in between the ALB and the ClearML server could be causing the problem.
Currently I'm "cheating" and counting a 405 as the success code for the healthcheck.
Hey SuccessfulKoala55 . I use my own custom Daemon that in turn runs clearml-agent execute
for some complicated reasons (other correlated processes) I want to be able to fetch and execute only certain task id, instead of pulling one from the queue.
JuicyFox94 I think I found the problem. To my absolute shame, the security group of the ALB had no Outbound rules, i.e. no traffic was allowed out of the ALB 🙈 . Now I can access the ClearML Webserver!
Ok thanks a lot for the Info! For now (as a simple error handling): is there any way I can tell the ClearML Server that the experiment should be cancelled using the shell?
I left the environment variables out to keep things short, but there is one SHUTDOWN_IF_NO_ACCESS_KEY: 1
. Maybe some authentication is failing and the container is stopping.
Thanks a lot! Yes, I don't see such a worker in the UI. docker ps
returns the containers below. I suppose the clearml-apiserver
is the relevant one.
Thanks a lot for the help debugging!
UPDATE: Now the agent-services
is working 🙂 I was able to solve it by providing CLEARML_API_HOST: ${CLEARML_API_HOST:-
None }
in my docker-compose instead of CLEARML_API_HOST:
None , where the environment variable CLEARML_API_HOST
was set as my public api address. So in other words, the traffic is going through the internet, back to the server (same machine) and now it seems to be working. Thanks @<1593051292383580160:...
This gives me a 200 🙂
And I could access the web server even if the health check was failing. So that was not a problem in the end.
UPDATE: setting SHUTDOWN_IF_NO_ACCESS_KEY: 1
allowed me to see the agent-services
container, and then a docker inspect clearml-agent-services
showed me that the environment variables needed for the agent in the docker-compose.yml
were empty. So the problem was in my bootstrap script.
Because SHUTDOWN_IF_NO_ACCESS_KEY
was set to 0
before, the container would disappear 🙂
Thanks a lot for helping me figure this out!
Yes exactly, like a cron Job. Thanks a lot!
These are the seetings for health check now
Ok, I think that's been very helpful 🙂 I'll experiment a little, now that I know a Health Check that must work. I'll write here if I find something! Thanks a lot for the awesome support!
Here's my docker-compose, maybe I'm missing something 😄 And thanks again for the support 😉
Hi @<1593051292383580160:profile|SoreSparrow36> , thanks a lot! I ran docker network connect backend clearml-agent-services
and got the response:Error response from daemon: endpoint with name clearml-agent-services already exists in network clearml_backend
It was expected because my docker-compose had the entry
agent-services:
networks:
- backend
I can also resolve and curl None from the clearml-agent-services
container.
I managed...
I have this block in my docker compose:
agent-services:
networks:
- backend
container_name: clearml-agent-services
image: allegroai/clearml-agent-services:latest
deploy:
restart_policy:
condition: on-failure
privileged: true
environment:
<....>
volumes:
- /var/run/docker.sock:/var/run/docker.sock
- /opt/clearml/agent:/root/.clearml
depends_on:
- apiserver
entrypoint: >
bash -c "curl --retry 10 --retr...