Hello Everyone, We’Re Encountering A Persistent Issue With Our Autoscaler Setup And Could Really Use Some Help. Despite Having The Autoscaler Running And The Queue (Default_Cpu) Properly Populated (87 Jobs Pending), The Tasks Are Never Picked Up And Exe

Answered

Hello everyone,

We’re encountering a persistent issue with our autoscaler setup and could really use some help.

Despite having the autoscaler running and the queue (default_cpu) properly populated (87 jobs pending), the tasks are never picked up and executed. We’ve tried the usual troubleshooting steps—restarting the app, shutting down and relaunching the instances—but nothing resolves the blockage.

From the logs, the autoscaler spins up c5.4xlarge instances, and they do appear to attach correctly to the queue. However, no jobs are ever executed before all the workers get marked idle and are spun down again (see screenshot and logs).

Does anyone know what might be preventing task execution in this context, or where we should look next?

Thanks a million in advance for any guidance! 🙏
— Emilie

  				
Posted 
	3 months ago

					More
				  		
  Report
		
					CourageousCoyote72
				
					0
					 × 1

Votes Newest

Answers 16

I do not see any artifacts linked to the jobs in the default_gpu queue. We have not changed the configuration; as a debugging step, we simply restarted the instance.

  				
Posted 
	3 months ago

					More
				  		
  Report
		
					CourageousCoyote72
				
					0
					 × 1

Same happened to us, today.
Usually everything works great but now autoscaler is just starting instances but running nothing.

  				
Posted 
	3 months ago

					More
				  		
  Report
		
					NastyBear13
				
					0
					 × 1

Hi,
It looks like the same issue is happening. It seems to be caused by the recent update of the clearml-agent package to version 2.0.0 .
When I start the queue locally, the agent appears in the list but doesn't pick up any tasks. On the agent side, I get the following error:

FATAL ERROR:
Traceback (most recent call last):
File "***.venv/lib/python3.12/site-packages/clearml_agent/commands/worker.py", line 2128, in daemon
self.run_tasks_loop(
File "***.venv/lib/python3.12/site-packages/clearml_agent/commands/worker.py", line 1464, in run_tasks_loop
self.monitor.setup_daemon_cluster_report(worker_id=self.worker_id, max_workers=1)
File "***.venv/lib/python3.12/site-packages/clearml_agent/helper/resource_monitor.py", line 283, in setup_daemon_cluster_report
self._cluster_report.pending = True
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'pending'

  				
Posted 
	3 months ago

					More
				  		
  Report
		
					IdealCamel90
				
					0
					 × 1

It now actually runs jobs but
@<1523701087100473344:profile|SuccessfulKoala55> I get:

clearml_agent: ERROR: Could not install task requirements!
expected SCALAR, SEQUENCE-START, MAPPING-START, or ALIAS

Which previously I didn't get
Previously version was 1.9.3, can we still run that clearml agent version through autoscaler?

  				
Posted 
	3 months ago

					More
				  		
  Report
		
					NastyBear13
				
					0
					 × 1

Hi all, we're looking into it right now

  				
Posted 
	3 months ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Another problem surfaced: same tasks that previously ran normally, now are failing with this error log

"""
clearml_agent: ERROR: Could not install task requirements!
expected SCALAR, SEQUENCE-START, MAPPING-START, or ALIAS
""""

  				
Posted 
	3 months ago

					More
				  		
  Report
		
					VirtuousHorse94
				
					0

Thank you for the detailed explanation. Can you please add a log of the ec2 instance itself? You can find it in the artifacts section of the autoscaler task. Is it the same autoscaler setup that used to work without issue or were there some changes introduced into the configuration?

  				
Posted 
	3 months ago

					More
				  		
  Report
		
					CostlyOstrich36
				
					0

Is there a way to override the version of clearml-agent that gets installed on the worker?

  				
Posted 
	3 months ago

					More
				  		
  Report
		
					IdealCamel90
				
					0
					 × 1

@<1855782485208600576:profile|CourageousCoyote72> , do you see crashes or machines going up or down? Or are just machines from EC2 are not being allocated?

  				
Posted 
	3 months ago

					More
				  		
  Report
		
					CostlyOstrich36
				
					0

It looks like the task.execute_remotely() method is somehow broken. Previously, when I used it, the task would run in the queue with the same parameters I set locally. But now the parameters are not being passed correctly, and I end up with two tasks: one that I launched locally (but it ends up running remotely), and another one — but without any parameters at all.
It's strange — how could the agent update have affected this?

  				
Posted 
	3 months ago

					More
				  		
  Report
		
					IdealCamel90
				
					0
					 × 1

Our jobs are now running on the online app 👏
Thank you

  				
Posted 
	3 months ago

					More
				  		
  Report
		
					CourageousCoyote72
				
					0
					 × 1

Unfortunately, the issue is only partially resolved: while some jobs are running on one instance, on another instance (default_gpu), our jobs are still pending… 😢

  				
Posted 
	3 months ago

					More
				  		
  Report
		
					CourageousCoyote72
				
					0
					 × 1

Hi everyone, it seems the

clearml_agent: ERROR: Could not install task requirements!
expected SCALAR, SEQUENCE-START, MAPPING-START, or ALIAS

issue is related to some verbose printing failure which we can't reproduce, but we have a pretty good idea where it happens. We'll release an RC of the agent soon to contain this issue, and would appreciate anyone who can reproduce the issue to try it out and let us know if a formal version should be released

  				
Posted 
	3 months ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Hello
Sorry for my late reply.

I’m running into an issue with my default_gpu queue: the ClearML auto-scaler detects the job and puts it into the “Pending” state, but it never actually runs. From the auto-scaler logs (see screenshot 1), this seems expected since it only checks the queue every 5 minutes. I’ve also attached the relevant log file.

However, I don’t see anything in the logs that clearly explains the problem. Looking at AWS, I can see that the instance starts, stays in “Initializing” for a while, and then terminates.
Sorry if this isn’t very clear, but I hope this info is still helpful!

  				
Posted 
	3 months ago

					More
				  		
  Report
		
					CourageousCoyote72
				
					0
					 × 1

Hi all, ClearML Agent v2.0.2rc0 is out - if you can try it out and let us know whether the issue is resolved, we'll appreciate it 🙏

  				
Posted 
	3 months ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

@<1855782485208600576:profile|CourageousCoyote72> , @<1837300695921856512:profile|NastyBear13> , @<1855782492460552192:profile|IdealCamel90> we've just released v2.0.1 which should affect this issue

  				
Posted 
	3 months ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Write your answer

506 Views

16 Answers

3 months ago