Thanks @<1523701087100473344:profile|SuccessfulKoala55> - Yeah I found that allegroai/clearml-agent-services:latest
was running clearml-agent==1.1.1
. Tried plugging various other images into docker-compose.yml
& restarting to see if versions clearml-agent==1.6.1
or clearml-agent==1.7.0
would fix the issue but no luck unfortunately 😕
Hi @<1534706830800850944:profile|ZealousCoyote89> ! Do you have any info under STATUS REASON
? See the screenshot for an example:
Just user abort by the looks of things:
Does this help at all? (I can go a lil further back, just scanning through for any potential sensitive info!)
Hi @<1534706830800850944:profile|ZealousCoyote89> , can you please add the full log?
To me it looks as if somebody were going in to the UI and hitting abort on the task but that's definitely not the case
Any time I run the agent locally via:
clearml-agent daemon --queue services --services-mode --cpu-only --docker --foreground
It works without fail so I've tried removing the clearml
mount from agent-services
in docker-compose.yml
:
CLEARML_WORKER_ID: "clearml-services"
# CLEARML_AGENT_DOCKER_HOST_MOUNT: "/opt/clearml/agent:/root/.clearml"
SHUTDOWN_IF_NO_ACCESS_KEY: 1
volumes:
- /var/run/docker.sock:/var/run/docker.sock
# - /opt/clearml/agent:/root/.clearml
I know there's some downfalls to doing this but it seems to prevent the Process terminated by user
issue I was seeing. Like I said, the issue appeared randomly so this could just be a coincidence.
Maybe some of the cached files could have been leading to the issue?
Hi @<1534706830800850944:profile|ZealousCoyote89> , I must admit I've not seen this behavior before occurring randomly, but I don't think the cache can be the result
Hi @<1523701070390366208:profile|CostlyOstrich36>
We've got quite a bit of sensitive info in the logs - I'll see what I can grab
Hi @<1534706830800850944:profile|ZealousCoyote89> , make sure you update the agent inside the services docker, as this image is probably running a very old version