default queue is served with (containerized + custom entrypoint) venv workers (agent services just wasn't working great for me, gave up)
Hi @<1689446563463565312:profile|SmallTurkey79> , when this happens, do you see anything in the API server logs? How is the agent running, on top of K8s or bare metal? Docker mode or venv?
let me downgrade my install of clearml and try again.
hoping this really is a 1.16.2 issue. fingers crossed. at this point more pipes are failing than not.
do you have the agent logs that is supposed to run your pipeline? Maybe there is a clue there. I would also suggest to try enqueuing the pipeline to some other queue, maybe even run the agent on your on machine if you do not already and see what happens
it's pretty reliably happening but the logs are just not informative. just stops midway
ugh. again. it launched all these tasks and then just died. logs go silent.
yeah locally it did run. I then ran another via UI spawned from the successful one, it showed cached steps and then refused to run the bottom one, disappearing again. No status message, no status reason. (not running... actually dead)
its odd... I really dont see tasks except the controller one dying
the workers connect to the clearml server via ssh-tunnels, so they all talk to "localhost" despite being deployed in different places. each task creates artifacts and metrics that are used downstream
when i run the pipe locally, im using the same connect.sh script as the workers are in order to poll the apiserver via the ssh tunnel.
that's the final screenshot. it just shows a bunch of normal "launching ..." steps, and then stops all the sudden.