Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hey Guys, I Am Trying To Plan What I Need To Do In Order To Efficiently Use Clearml With Spot Instances 1) Detecting When Spot Instance Is Down And Experiment Is Aborted 2) Extracting S3 Address Of The Latest Checkpoint From Clearml Api 3) Starting New E

hey guys, I am trying to plan what I need to do in order to efficiently use ClearML with spot instances

  1. detecting when spot instance is down and experiment is aborted
  2. extracting S3 address of the latest checkpoint from ClearML API
  3. starting new experiment with this address as an argument
  4. merging aborted and new experiment, so that we can see all graphs and metrics nicely on one page

1-3 seems more or less straightforward, but what about 4? anybody has an example code of how you would go around merging two experiments (aborted and restarted)?

  
  
Posted 3 years ago
Votes Newest

Answers 5


Hi DilapidatedDucks58 , I did that already, but I am reusing the same experiment instead of merging two experiments. Step 4 can be seen as:
Update the experiment status to stopped (if it is failed, you won’t be able to re-enqueue it) Set a parameter of that task to point to the latest checkpoint and load it (you can also infer it directy: I simply add a tag to the task resume , and check at runtime if this tag exists, if yes, I fetch the latest checkpoint of the task) Use https://clear.ml/docs/latest/docs/references/sdk/task#set_initial_iteration to prevent the task to overwrite the already logged iterations (ClearML should detect and handle it automatically, but it wasn’t the case for me)

  
  
Posted 3 years ago

nice! exactly what I need, thank you!

  
  
Posted 3 years ago

Very Cool!
BTW guys, are you using the task.models[] to continue from the last checkpoint? or is it task.artifacts[] ?

  
  
Posted 3 years ago

we use task.models[] 🙂

  
  
Posted 3 years ago

JitteryCoyote63 how do you detect spot interruption is coming from within the http://clear.ml task in time to mark it as “resume”?

  
  
Posted 2 years ago
1K Views
5 Answers
3 years ago
one year ago
Tags