Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Hi, I Am Using Spot Instances (Launched By Auto-Scaler Pro Service), But I See That Sometimes The Instance Is Suddenly Terminated And The Task Status Is Still 'Running' ( And Not 'Failed' Or Something Else

I am using spot instances (launched by auto-scaler PRO service), but I see that sometimes the instance is suddenly terminated and the task status is still 'running' ( and not 'failed' or something else 😞 )

2022-09-12 08:50:10,453 - clearml.Auto-Scaler - WARNING - instance 'i-***********************' crashed

The auto-scaler launched a new instance but doesn't execute the unfinished job:

'clearml.Auto-Scaler - INFO - Found 0 tasks in queue'

I think it is a main issue since there isn't any message of failure on the job itself ( just on the auto-scaler jobs) and also hasn't any rollback to a new instance.
I will appreciate your help 🙂

Posted one year ago
Votes Newest

Answers 5

CostlyOstrich36 actually, as I see it from logs, it spun a new instance but not retried the task, maybe because shown 'Found 0 tasks in queue' (after a mission is running it is removed automatically from queue). I cant see something new in the task logs also (still stuck), so it seems that the problem still exist .

Posted one year ago

Hi SmugTurtle78 , this issue is handled in the coming update of ClearML PRO

Posted one year ago

SmugTurtle78 , I think so. Can you verify on your end?

Posted one year ago

ok, thanks 🙂

Posted one year ago

CostlyOstrich36 , Is it already solved?

Posted one year ago