Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hi, I Have A Long Running Experiment That Was Running On Aws Instance That Got Killed After ~4 Days With The Following Reason:

Hi, I have a long running experiment that was running on AWS instance that got killed after ~4 days with the following reason: STATUS REASON: Forced stop (non-responsive)
What happened? The clearml-server asked the clearml-agent to stop the task because it didn’t got anything for a long time? Is this period controlled by a parameter that can be changed?

  
  
Posted one year ago
Votes Newest

Answers 5


Well, if the task was indeed running, it's strange that it was stopped since tasks have a thread that is in charge of pinging the server to make sure the server knows they're still running, so maybe there was some network issue?

  
  
Posted one year ago

I assume you’re using a self-hosted server?

Yes

  
  
Posted one year ago

Thanks! I will investigate further, I am thinking that the AWS instance might have been stuck for an unknown reason (becoming unhealthy)

  
  
Posted one year ago

In any case, the watchdog setting can be controlled using the services.tasks.non_responsive_tasks_watchdog.threshold_sec server configuration setting (default is 7200 seconds)

  
  
Posted one year ago

Hi JitteryCoyote63 ,

The clearml-server asked the clearml-agent to stop the task because it didn’t got anything for a long time?

Seems so - there's a "non-responsive tasks" watchdog on the server in charge of doing exactly that. I assume you're using a self-hosted server?

  
  
Posted one year ago