I just tested the master with https://github.com/jkhenning/ignite/blob/fix_trains_checkpoint_n_saved/examples/contrib/mnist/mnist_with_trains_logger.py on the latest ignite master and Trains, it passed, but so did the previous commit...
To be honest, I'm not sure I have a good explanation on why ... (unless on some scenarios an exception was thrown and caught silently and caused it)
Alright, experiment finished properly (all models uploaded). I will restart it to check again, but seems like the bug was introduced after that
I was unable to reproduce, but I added a few safety checks. I'll make sure they are available on the master in a few minutes, could maybe rerun after?
Seems to works, I started a last one to confirm!
Not using pytorch distributed, all models are uploaded to s3 yes
JitteryCoyote63 fix pushed to master, let me know if it passes...
Thanks for the quick responses and support too! 🙂
I started a last one to confirm!
You mean a second run, just to make sure ?
Okay there now:
https://github.com/allegroai/trains/tree/0.15.1rc0
The experiment finished completely this time again
Yes, it is supposed to run for 200 epochs
JitteryCoyote63 while it's running, could you give me a few details on the setup, maybe I can reproduce it.
Is it using pytorch distributed ?
Are all models uploaded to S3 ?
etc.
(It would be nice to have all the Pypi releases tagged in github btw)
I wanted to say, we listen ... and point to the tag , but for some reason it was not pushed LOL.
BTW:
Just making sure, 74 was not supposed to be the last checkpoint (in other words it is not stuck on leaving the training process, but actually in the middle)
Which commit corresponds to RC version? So far we tested with latest commit on master (9a7850b23d2b0e1f2098ab051de58ce806143fff)
Just checked, it did pass, training finished and all 200 models saved 🙂
"Updates a few seconds ago"
That just means that the process is not dead.
Yes that seemed to be stuck 😞
Any chance you can verify with the RC version?
I'll try to dig into the commits, maybe I can come up with an explanation ...
And thanks again, I really appreciate testing it!
The experiment finished completely this time again
With the RC version or the latest ?
(It would be nice to have all the Pypi releases tagged in github btw)