Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hello Everyone, We’Re Encountering A Persistent Issue With Our Autoscaler Setup And Could Really Use Some Help. Despite Having The Autoscaler Running And The Queue (Default_Cpu) Properly Populated (87 Jobs Pending), The Tasks Are Never Picked Up And Exe

Hello everyone,

We’re encountering a persistent issue with our autoscaler setup and could really use some help.

Despite having the autoscaler running and the queue (default_cpu) properly populated (87 jobs pending), the tasks are never picked up and executed. We’ve tried the usual troubleshooting steps—restarting the app, shutting down and relaunching the instances—but nothing resolves the blockage.

From the logs, the autoscaler spins up c5.4xlarge instances, and they do appear to attach correctly to the queue. However, no jobs are ever executed before all the workers get marked idle and are spun down again (see screenshot and logs).

Does anyone know what might be preventing task execution in this context, or where we should look next?

Thanks a million in advance for any guidance! 🙏
— Emilie
image

  
  
Posted one month ago
Votes Newest

Answers 16


It looks like the task.execute_remotely() method is somehow broken. Previously, when I used it, the task would run in the queue with the same parameters I set locally. But now the parameters are not being passed correctly, and I end up with two tasks: one that I launched locally (but it ends up running remotely), and another one — but without any parameters at all.
It's strange — how could the agent update have affected this?

  
  
Posted one month ago

Our jobs are now running on the online app 👏
Thank you

  
  
Posted one month ago

Hi all, we're looking into it right now

  
  
Posted one month ago

Unfortunately, the issue is only partially resolved: while some jobs are running on one instance, on another instance (default_gpu), our jobs are still pending… 😢
image
image

  
  
Posted one month ago

Hello
Sorry for my late reply.

I’m running into an issue with my default_gpu queue: the ClearML auto-scaler detects the job and puts it into the “Pending” state, but it never actually runs. From the auto-scaler logs (see screenshot 1), this seems expected since it only checks the queue every 5 minutes. I’ve also attached the relevant log file.

However, I don’t see anything in the logs that clearly explains the problem. Looking at AWS, I can see that the instance starts, stays in “Initializing” for a while, and then terminates.
Sorry if this isn’t very clear, but I hope this info is still helpful!
image
image

  
  
Posted one month ago

Thank you for the detailed explanation. Can you please add a log of the ec2 instance itself? You can find it in the artifacts section of the autoscaler task. Is it the same autoscaler setup that used to work without issue or were there some changes introduced into the configuration?

  
  
Posted one month ago

I do not see any artifacts linked to the jobs in the default_gpu queue. We have not changed the configuration; as a debugging step, we simply restarted the instance.
image

  
  
Posted one month ago

Hi everyone, it seems the

clearml_agent: ERROR: Could not install task requirements!
expected SCALAR, SEQUENCE-START, MAPPING-START, or ALIAS

issue is related to some verbose printing failure which we can't reproduce, but we have a pretty good idea where it happens. We'll release an RC of the agent soon to contain this issue, and would appreciate anyone who can reproduce the issue to try it out and let us know if a formal version should be released

  
  
Posted one month ago

Hi all, ClearML Agent v2.0.2rc0 is out - if you can try it out and let us know whether the issue is resolved, we'll appreciate it 🙏

  
  
Posted one month ago

Same happened to us, today.
Usually everything works great but now autoscaler is just starting instances but running nothing.

  
  
Posted one month ago

Is there a way to override the version of clearml-agent that gets installed on the worker?

  
  
Posted one month ago

@<1855782485208600576:profile|CourageousCoyote72> , do you see crashes or machines going up or down? Or are just machines from EC2 are not being allocated?

  
  
Posted one month ago

@<1855782485208600576:profile|CourageousCoyote72> , @<1837300695921856512:profile|NastyBear13> , @<1855782492460552192:profile|IdealCamel90> we've just released v2.0.1 which should affect this issue

  
  
Posted one month ago

Hi,
It looks like the same issue is happening. It seems to be caused by the recent update of the clearml-agent package to version 2.0.0 .
When I start the queue locally, the agent appears in the list but doesn't pick up any tasks. On the agent side, I get the following error:

FATAL ERROR:
Traceback (most recent call last):
File "***.venv/lib/python3.12/site-packages/clearml_agent/commands/worker.py", line 2128, in daemon
self.run_tasks_loop(
File "***.venv/lib/python3.12/site-packages/clearml_agent/commands/worker.py", line 1464, in run_tasks_loop
self.monitor.setup_daemon_cluster_report(worker_id=self.worker_id, max_workers=1)
File "***.venv/lib/python3.12/site-packages/clearml_agent/helper/resource_monitor.py", line 283, in setup_daemon_cluster_report
self._cluster_report.pending = True
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'pending'

  
  
Posted one month ago

It now actually runs jobs but
@<1523701087100473344:profile|SuccessfulKoala55> I get:

clearml_agent: ERROR: Could not install task requirements!
expected SCALAR, SEQUENCE-START, MAPPING-START, or ALIAS

Which previously I didn't get
Previously version was 1.9.3, can we still run that clearml agent version through autoscaler?

  
  
Posted one month ago

Another problem surfaced: same tasks that previously ran normally, now are failing with this error log

"""
clearml_agent: ERROR: Could not install task requirements!
expected SCALAR, SEQUENCE-START, MAPPING-START, or ALIAS
""""

  
  
Posted one month ago
246 Views
16 Answers
one month ago
one month ago
Tags
Similar posts