Hi, Is It Possible To Resume An Experiment That Stopped Unexpectedly, By Using A Checkpoint Of The Model?

Posted 3 years ago
Votes Newest

Answers 5

If you have the check point (see output_uri for automatically uploading it) then you can always load it. Do you mean if you can change the iteration/ step counter? Or do you mean with trains-agent?

Posted 3 years ago

I would clone the first experiment, then in the cloned experiment, I would change the initial weights (assuming there is a parameter storing that) to point to the latest checkpoint, i.e. provide the full path/link. Then enqueue it for execution. The downside is that the iteration counter will start from 0 and not the previous run.

Posted 3 years ago

AstonishingSeaturtle47 , makes sense?

Posted 3 years ago

Yes, Thanks

Posted 3 years ago

Is it considered the same experiment? Is it possible to use the trains-agent? Submit a resume from the UI?

Posted 3 years ago
