Unanswered
Hello Everyone,
We’Re Encountering A Persistent Issue With Our Autoscaler Setup And Could Really Use Some Help.
Despite Having The Autoscaler Running And The Queue (Default_Cpu) Properly Populated (87 Jobs Pending), The Tasks Are Never Picked Up And Exe
Hello
Sorry for my late reply.
I’m running into an issue with my default_gpu queue: the ClearML auto-scaler detects the job and puts it into the “Pending” state, but it never actually runs. From the auto-scaler logs (see screenshot 1), this seems expected since it only checks the queue every 5 minutes. I’ve also attached the relevant log file.
However, I don’t see anything in the logs that clearly explains the problem. Looking at AWS, I can see that the instance starts, stays in “Initializing” for a while, and then terminates.
Sorry if this isn’t very clear, but I hope this info is still helpful!
39 Views
0
Answers
one month ago
one month ago