Hi, Is It Possible To Resume An Experiment That Stopped Unexpectedly, By Using A Checkpoint Of The Model?

Answered

Hi, Is it possible to resume an experiment that stopped unexpectedly, by using a checkpoint of the model?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AstonishingSeaturtle47
				
					0
					 × 1

Votes Newest

Answers 5

I would clone the first experiment, then in the cloned experiment, I would change the initial weights (assuming there is a parameter storing that) to point to the latest checkpoint, i.e. provide the full path/link. Then enqueue it for execution. The downside is that the iteration counter will start from 0 and not the previous run.

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Is it considered the same experiment? Is it possible to use the trains-agent? Submit a resume from the UI?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AstonishingSeaturtle47
				
					0
					 × 1

AstonishingSeaturtle47 , makes sense?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Yes, Thanks

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AstonishingSeaturtle47
				
					0
					 × 1

If you have the check point (see output_uri for automatically uploading it) then you can always load it. Do you mean if you can change the iteration/ step counter? Or do you mean with trains-agent?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Write your answer

1K Views

5 Answers

4 years ago

one year ago