Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hello Everyone, I’M Currently Facing An Issue While Using Cloud Clearml With Aws_Autoscaler.Py. Occasionally, Some Workers Become Unusable When An Ec2 Instance Is Terminated, Either Manually Or By Aws_Autoscaler.Py. These Workers Are Displayed In The Ui W

Hello everyone, I’m currently facing an issue while using Cloud ClearML with aws_autoscaler.py. Occasionally, some workers become unusable when an EC2 instance is terminated, either manually or by aws_autoscaler.py. These workers are displayed in the UI with the message “Update Time N minutes ago”. The main problem is that these workers block the entire queue, preventing the start of new tasks. When I enqueue a new task, it remains pending because the autoscaler recognizes the existing worker and doesn’t attempt to start a new EC2 instance. As a result, the only solution is to wait for the timeout of 10 minutes until the worker is removed by app.clear.ml.
Solutions I’ve considered:

  1. I’ve tried removing the worker programmatically using the “workers.unregister” method. However, it only works within the same session as the workers.register. Please note that I last checked this functionality a year ago, so it might have changed since then.
  2. The 10-minute interval for the timeout is not configurable and cannot be changed in app.clear.ml.
  3. While I appreciate the convenience of the cloud service, I’m hesitant to deploy an on-premise version of app.clear.ml.
    If anyone knows of a workaround for this issue, please let me know. Your assistance would be greatly appreciated. Thank you.
  
  
Posted 10 months ago
Votes Newest

Answers 3


Yes. I’ve done some debugging and discovered that process started from user-data script doesn’t receive SIGTERM on instance termination. So worker is unable to gracefully shutdown and unregister.

  
  
Posted 10 months ago

Hi @<1571308079511769088:profile|GentleParrot65> , ideally you shouldn't be terminating instances manually. However you mean that the autoscaler spins down a machine and still recognizes it as running and refuses to spin up a new machine?

  
  
Posted 10 months ago

More investigation showed, that there is a problem with cloud init. When I connect via ssh and start process with “nohup python … & “, everything works, process receives SIGTERM on instance termination. Process started with could init (user data script) receives no signals on instance termination (but it receives signals send with kill <pid>). I’ve tried following:

  • start with nohup python3 -m clearml-agent … &
  • start agent with --detached flag. Nothing works. So it looks like a bug.
  
  
Posted 10 months ago
503 Views
3 Answers
10 months ago
9 months ago
Tags