I think you should investigate what happens during docker-compose up to see why the services agent docker isn't running
Hi @<1523702018001080320:profile|StoutElephant16> , it looks like the services agent isn't running for some reason if your task stays in pending. You should check the docker status and logs to see what's going on
I suggest starting with docker ps
and then checking on the logs of the relevant docker
Yes thanks a lot 🙂 This already helped me a lot 😉 I'll investigate!
Thanks a lot! Yes, I don't see such a worker in the UI. docker ps
returns the containers below. I suppose the clearml-apiserver
is the relevant one.
In the web UI, in the queue/worker tab, you should see a service queue and a worker available in that queue. Otherwise the service agent is not running. Refer to John c above
UPDATE: setting SHUTDOWN_IF_NO_ACCESS_KEY: 1
allowed me to see the agent-services
container, and then a docker inspect clearml-agent-services
showed me that the environment variables needed for the agent in the docker-compose.yml
were empty. So the problem was in my bootstrap script.
Because SHUTDOWN_IF_NO_ACCESS_KEY
was set to 0
before, the container would disappear 🙂
Thanks a lot for helping me figure this out!
I left the environment variables out to keep things short, but there is one SHUTDOWN_IF_NO_ACCESS_KEY: 1
. Maybe some authentication is failing and the container is stopping.
I have this block in my docker compose:
agent-services:
networks:
- backend
container_name: clearml-agent-services
image: allegroai/clearml-agent-services:latest
deploy:
restart_policy:
condition: on-failure
privileged: true
environment:
<....>
volumes:
- /var/run/docker.sock:/var/run/docker.sock
- /opt/clearml/agent:/root/.clearml
depends_on:
- apiserver
entrypoint: >
bash -c "curl --retry 10 --retry-delay 10 --retry-connrefused '
' && /usr/agent/entrypoint.sh"
Is the services agent part of the docker compose?