Unanswered
Hi All,
What Would Cause A Clearml Autoscaler Instance Which Is Running Enqueued Tasks One After Another To Eventually Get Stuck In A 'Running' State While Emitting No Logs?
For Context, The Issue Happens Earlier On (Fewer Tasks Into The Loop) For A G4D
Hi all,
What would cause a ClearML autoscaler instance which is running enqueued tasks one after another to eventually get stuck in a 'Running' state while emitting no logs?
For context, the issue happens earlier on (fewer Tasks into the loop) for a g4dn.xlarge instance compared to a g4dn.2xlarge instance. The 2xlarge is able to complete roughly twice as many successful tasks before eventually hanging. This is the final message I get before it gets stuck: ClearML Monitor: Could not detect iteration reporting, falling back to iterations as seconds-from-start
Has anyone seen this issue before?
147 Views
0
Answers
2 months ago
2 months ago
Tags