Hi @<1534706830800850944:profile|ZealousCoyote89> , can you please add the full log?
Hi @<1534706830800850944:profile|ZealousCoyote89> , make sure you update the agent inside the services docker, as this image is probably running a very old version
Any time I run the agent locally via:
clearml-agent daemon --queue services --services-mode --cpu-only --docker --foreground
It works without fail so I've tried removing the clearml mount from agent-services in docker-compose.yml :
CLEARML_WORKER_ID: "clearml-services"
# CLEARML_AGENT_DOCKER_HOST_MOUNT: "/opt/clearml/agent:/root/.clearml"
SHUTDOWN_IF_NO_ACCESS_KEY: 1
volumes:
- /var/run/docker.sock:/var/run/docker.sock
# - /opt/clearml/agent:/root/.clearml
I know there's some downfalls to doing this but it seems to prevent the Process terminated by user issue I was seeing. Like I said, the issue appeared randomly so this could just be a coincidence.
Maybe some of the cached files could have been leading to the issue?
Just user abort by the looks of things:
Hi @<1523701070390366208:profile|CostlyOstrich36>
We've got quite a bit of sensitive info in the logs - I'll see what I can grab
Hi @<1534706830800850944:profile|ZealousCoyote89> , I must admit I've not seen this behavior before occurring randomly, but I don't think the cache can be the result
Hi @<1534706830800850944:profile|ZealousCoyote89> ! Do you have any info under STATUS REASON ? See the screenshot for an example:
Does this help at all? (I can go a lil further back, just scanning through for any potential sensitive info!)
To me it looks as if somebody were going in to the UI and hitting abort on the task but that's definitely not the case
Thanks @<1523701087100473344:profile|SuccessfulKoala55> - Yeah I found that allegroai/clearml-agent-services:latest was running clearml-agent==1.1.1 . Tried plugging various other images into docker-compose.yml & restarting to see if versions clearml-agent==1.6.1 or clearml-agent==1.7.0 would fix the issue but no luck unfortunately 😕