Hi SolidSealion72
"/tmp" contained alot of artifacts from ClearML past runs (1.6T in our case).
How did you end up with 1.6TB of artifacts there? what are the workflows on that machine? at least in theory, there should not be any leftover in the tmp folder, after the process is completed.
Hi AgitatedDove14
It appears that /tmp was not cleared, and in addition we upload many large artifacts through clearml.
I am not sure not if the /tmp was not cleared by clearml or pytorch. Since both seem to utilize the tmp folder for storing files. My error anyway was generated by Pytorch:
https://discuss.pytorch.org/t/num-workers-in-dataloader-always-gives-this-error/64718
The /tmp was full, and pytorch tried moving the /tmp to a local directory which is a network nfs drive, hence the error (too may connections to something). So the issue was a full /tmp that wasn't cleared, though I am not sure which program did not clear it, pytorch or clearml. Most likely of trainings that died prematurely left leftovers.
SolidSealion72 this makes sense, clearml deletes artifacts/models after they are uploaded, so I have to assume these are torch internal files