Hi JitteryCoyote63 , how are you running the experiments? What's the OS/platform?
Hi SuccessfulKoala55 I was able to find the issue, I was creating a queue and worker subprocess that were not properly cleaned up
Hi there, I have several experiments hanging/stuck in the middle or at the end of the training, with the last message logged being:
train INFO: Engine run complete. Time taken: 00:16:18
clearml.reporter - WARNING - Event reporting sub-process lost, switching to thread based reporting
What could be reason? How can I debug them? (I cannot reproduce locally and I don't have a clue of where the task could be stuck and why)
Hi JitteryCoyote63 , how are you running the experiments? What's the OS/platform?
Hi SuccessfulKoala55 I was able to find the issue, I was creating a queue and worker subprocess that were not properly cleaned up