@<1719524641879363584:profile|ThankfulClams64> , can you provide a small code snippet that reproduces this behaviour? Can you also test with the latest version of clearml
?
@<1719524641879363584:profile|ThankfulClams64> , are logs showing up without issue on the 'problematic' machine?
It seems similar to this None is it possible saving too many model weights causes metric logging thread to die?
We are running the same code on multiple machines and it just randomly happens. Currently we are having the issue on 1 out of 4
The same training works sometimes. But I'm not sure how to troubleshoot when it stops logging the metrics
Hi @<1719524641879363584:profile|ThankfulClams64> ,the logging is done by a separate process, I'm pretty sure it's not terminating all of the sudden. Did you manage to get a full log of such an experiment to share?
sometimes I get no scalars, but the console logging always seems to be working