Hi, Together With

Answered

Hi, Together With

Hi,
Together with ElegantKangaroo44 we found two unexpected behaviors in task.models['output'] :
The input model of the task is included in the list The best model is not included in the listWe log models using Ignite TrainSaver (pytorch_ignite == 0.4rc0.post1), any idea?

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

Votes Newest

Answers 30

Sure 🙂

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

Which commit corresponds to RC version? So far we tested with latest commit on master (9a7850b23d2b0e1f2098ab051de58ce806143fff)

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

The experiment finished completely this time again

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

JitteryCoyote63 passed ?

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Seems to works, I started a last one to confirm!

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

I started a last one to confirm!

You mean a second run, just to make sure ?

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Alright, I will try with that one

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

To be honest, I'm not sure I have a good explanation on why ... (unless on some scenarios an exception was thrown and caught silently and caused it)

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

JitteryCoyote63 fix pushed to master, let me know if it passes...

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

(It would be nice to have all the Pypi releases tagged in github btw)

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

JitteryCoyote63 How is it so far ?

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Not using pytorch distributed, all models are uploaded to s3 yes

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

Exactly

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

Thanks for the quick responses and support too! 🙂

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					ElegantKangaroo44
				
					0
					 × 1

Okay there now:
https://github.com/allegroai/trains/tree/0.15.1rc0

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

I'm happy to hear! 😅

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Thanks! Will test now

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

"Updates a few seconds ago"

That just means that the process is not dead.

Yes that seemed to be stuck 😞
Any chance you can verify with the RC version?
I'll try to dig into the commits, maybe I can come up with an explanation ...

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

I just tested the master with https://github.com/jkhenning/ignite/blob/fix_trains_checkpoint_n_saved/examples/contrib/mnist/mnist_with_trains_logger.py on the latest ignite master and Trains, it passed, but so did the previous commit...

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

with the RC version

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

Alright, experiment finished properly (all models uploaded). I will restart it to check again, but seems like the bug was introduced after that

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

Just checked, it did pass, training finished and all 200 models saved 🙂

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					ElegantKangaroo44
				
					0
					 × 1

I was unable to reproduce, but I added a few safety checks. I'll make sure they are available on the master in a few minutes, could maybe rerun after?

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

And thanks again, I really appreciate testing it!

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

JitteryCoyote63 while it's running, could you give me a few details on the setup, maybe I can reproduce it.
Is it using pytorch distributed ?
Are all models uploaded to S3 ?
etc.

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Yes, it is supposed to run for 200 epochs

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

using trains RC, trains-agent 0.15.0

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

BTW:
Just making sure, 74 was not supposed to be the last checkpoint (in other words it is not stuck on leaving the training process, but actually in the middle)

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

(It would be nice to have all the Pypi releases tagged in github btw)

I wanted to say, we listen ... and point to the tag , but for some reason it was not pushed LOL.

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

The experiment finished completely this time again

With the RC version or the latest ?

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Write your answer

2K Views

30 Answers

5 years ago

2 years ago