Yeah, I experienced the same issue. Training stopps / freezes at the end of the 10th, or 15th epoch. Using pytorch_lightning
as well.
I am using pytorch_lightning
, I'll try to create a snippet I can share! Thanks 🙌
GrievingTurkey78 , what framework are you working with? Can you provide some more information regarding your environment - linux/windows, pip/conda? Can you provide maybe a snippet of your code I can try to run to reproduce?
Hey CostlyOstrich36 ! I am using clearml==1.1.2
and clearml-agent==1.1.0
. Stopped is not the right word, more like frozen, it just froze at an epoch. The console on the agent shows epoch 33 first batch and the one at the server epoch 32 last batch. The experiment was running for ~6 hours.
You can check the run time by switching to 'wall time' axis 🙂
GrievingTurkey78 Hi!
What versions of clearml
and clearml-agent
are you using? Also for how long were the experiments were going?
Seems like agent is still reporting iterations and usage for the experiment so what do you mean by stopped?