Hi @<1534706830800850944:profile|ZealousCoyote89> , can you please add the full log?
Does this help at all? (I can go a lil further back, just scanning through for any potential sensitive info!)
Hi @<1523701070390366208:profile|CostlyOstrich36>
We've got quite a bit of sensitive info in the logs - I'll see what I can grab
Hi @<1534706830800850944:profile|ZealousCoyote89> , I must admit I've not seen this behavior before occurring randomly, but I don't think the cache can be the result
Hi @<1534706830800850944:profile|ZealousCoyote89> ! Do you have any info under STATUS REASON
? See the screenshot for an example:
To me it looks as if somebody were going in to the UI and hitting abort on the task but that's definitely not the case
Thanks @<1523701087100473344:profile|SuccessfulKoala55> - Yeah I found that allegroai/clearml-agent-services:latest
was running clearml-agent==1.1.1
. Tried plugging various other images into docker-compose.yml
& restarting to see if versions clearml-agent==1.6.1
or clearml-agent==1.7.0
would fix the issue but no luck unfortunately 😕
Any time I run the agent locally via:
clearml-agent daemon --queue services --services-mode --cpu-only --docker --foreground
It works without fail so I've tried removing the clearml
mount from agent-services
in docker-compose.yml
:
CLEARML_WORKER_ID: "clearml-services"
# CLEARML_AGENT_DOCKER_HOST_MOUNT: "/opt/clearml/agent:/root/.clearml"
SHUTDOWN_IF_NO_ACCESS_KEY: 1
volumes:
- /var/run/docker.sock:/var/run/docker.sock
# - /opt/clearml/agent:/root/.clearml
I know there's some downfalls to doing this but it seems to prevent the Process terminated by user
issue I was seeing. Like I said, the issue appeared randomly so this could just be a coincidence.
Maybe some of the cached files could have been leading to the issue?
Hi @<1534706830800850944:profile|ZealousCoyote89> , make sure you update the agent inside the services docker, as this image is probably running a very old version
Just user abort by the looks of things: