CourageousCoyote72

1 Question, 4 Answers

Active since 15 July 2025

Last activity 4 months ago

Reputation

Badges 1

4 × Eureka!

Questions 1
Answers 4

0 Votes

16 Answers

595 Views

0 Votes 16 Answers 595 Views

Hello Everyone, We’Re Encountering A Persistent Issue With Our Autoscaler Setup And Could Really Use Some Help. Despite Having The Autoscaler Running And The Queue (Default_Cpu) Properly Populated (87 Jobs Pending), The Tasks Are Never Picked Up And Exe

Hello everyone, We’re encountering a persistent issue with our autoscaler setup and could really use some help. Despite having the autoscaler running and the...

mlops

4 months ago

0 Hello Everyone, We’Re Encountering A Persistent Issue With Our Autoscaler Setup And Could Really Use Some Help. Despite Having The Autoscaler Running And The Queue (Default_Cpu) Properly Populated (87 Jobs Pending), The Tasks Are Never Picked Up And Exe

Our jobs are now running on the online app 👏
Thank you

4 months ago

Unfortunately, the issue is only partially resolved: while some jobs are running on one instance, on another instance (default_gpu), our jobs are still pending… 😢

4 months ago

I do not see any artifacts linked to the jobs in the default_gpu queue. We have not changed the configuration; as a debugging step, we simply restarted the instance.

4 months ago

Hello
Sorry for my late reply.

I’m running into an issue with my default_gpu queue: the ClearML auto-scaler detects the job and puts it into the “Pending” state, but it never actually runs. From the auto-scaler logs (see screenshot 1), this seems expected since it only checks the queue every 5 minutes. I’ve also attached the relevant log file.

However, I don’t see anything in the logs that clearly explains the problem. Looking at AWS, I can see that the instance starts, stays in “Initializi...

4 months ago