BTW, VexedKangaroo32 are you using torch launch ?
Since this fix is all about synchronizing different processes, we wanted to be extra careful with the release. That said I think that what we have now should be quite stable. Plan is to have the RC available right after the weekend.
Hi VexedKangaroo32 , funny enough this is one of the fixes we will be releasing soon. There is a release scheduled for later this week, right after that I'll put here a link to an RC containing a fix to this exact issue.
Thanks VexedKangaroo32 , this is great news :)
So the way it will work, is you will also need to have a Task.init in main process (the one using the launch function) and the same Task.init in the main_func. What it does is it signals the sub processes to use the main process task. This way they all report to the same task. Obviously to test it you will need to wait for the RC (after the weekend :)
Hi VexedKangaroo32 , there is now an RC with a fix:pip install trains==0.13.4rc0
Let me know if it solved the problem
I'm not using torch launch, but the launch function in https://github.com/facebookresearch/detectron2/blob/master/detectron2/engine/launch.py I placed Task.init(...) inside the "main_func" that gets called in mp.spawn.
Meant to get back to you a bit sooner, but I can report that I no longer have duplicate tasks after updating to 0.13.4rc0 and putting Task.init in those two places. The job hasn't run to completion, so I can't report if it ends cleanly or not.
No manual logging attempted Tensorboard, terminal outputs, and argparser log properly No longer need the "-W ignore" arguments