Hi @<1689446563463565312:profile|SmallTurkey79> ! I will take a look at this and try to replicate the issue. In the meantime, I suggest you look into other dependencies you are using. Maybe some dependency got upgraded and the upgrade now triggers this behaviour in clearml.
None here's how I'm establishing worker-server (and client-server) comms fwiw
did you take a look at my connect.sh script? I dont think it's a problem since only the controller task is the problem.
Is there some sort of culling procedure that kills tasks by any chance? the lack of logs makes me think it's something like that.
I can also try different agent versions.
N/A (still shows as running despite Abort being sent)
it's pretty reliably happening but the logs are just not informative. just stops midway
are you running this locally or are you enqueueing the task (controller)?
I have tried other queues, they're all running the same container.
so far the only thing reliable is pipe.start_locally()
trying to run the experiment that kept failing right now, watching logs (they go by fast)... will try to spot anything anamolous
do you have any STATUS REASON under the INFO section of the controller task?
damn. I can't believe it. It disappeared again despite having 1.15.1 be the task's clearml version.
I'm going to try running the pipeline locally.
would it be on the pipeline task itself then, since that's what's disappearing?
I will do some experiment comparisons and see if there are package diffs. thanks for the tip.