Hi! I have some ClearML agents on GCP and sometimes the instance seems to reboot making the experiment fail and all the progress is lost. What is the best way to resume an experiment? 🥲

Posted 2 years ago
Votes Newest

Answers 3

Hey GrievingTurkey78 ,

Please take a look here : https://clear.ml/docs/latest/docs/references/sdk/task#taskinit

I think what you're looking for is this:
Task.init(.., continue_last_task=True )

Just search for this parameter for more more info 🙂

Posted 2 years ago

Hey CostlyOstrich36 sorry to ping you! Let's say I enqueue multiple experiments on a couple of agents and one of them fails. Is it possible to restart the experiment from the UI using the latest checkpoint? What if the experiment gets assigned to the other agent? I am not sure how the continue_last_task flag would help in this case.

Posted 2 years ago

Thanks 🙌

Posted 2 years ago
