Hi all,

What would cause a ClearML autoscaler instance which is running enqueued tasks one after another to eventually get stuck in a 'Running' state while emitting no logs?

For context, the issue happens earlier on (fewer Tasks into the loop) for a g4dn.xlarge instance compared to a g4dn.2xlarge instance. The 2xlarge is able to complete roughly twice as many successful tasks before eventually hanging. This is the final message I get before it gets stuck: ClearML Monitor: Could not detect iteration reporting, falling back to iterations as seconds-from-start

Has anyone seen this issue before?

Posted 20 days ago
20 days ago
