Reputation
Badges 1
Eureka!AgitatedDove14 I managed to reproduce on Ubuntu (but not on Windows):
Not every run gets stuck, sometimes it's 1 in 10 runs that gets stuck.
https://github.com/maor121/clearml-bug-reproduction
AgitatedDove14 Also, I found out that adding "pool.join()" after pool.close() seem to solve the issue in the minimal example.
Hi AgitatedDove14
It appears that /tmp was not cleared, and in addition we upload many large artifacts through clearml.
I am not sure not if the /tmp was not cleared by clearml or pytorch. Since both seem to utilize the tmp folder for storing files. My error anyway was generated by Pytorch:
https://discuss.pytorch.org/t/num-workers-in-dataloader-always-gives-this-error/64718
The /tmp was full, and pytorch tried moving the /tmp to a local directory which is a network nfs drive, hence the...