Unanswered
Suggestion
Hi AgitatedDove14
It appears that /tmp was not cleared, and in addition we upload many large artifacts through clearml.
I am not sure not if the /tmp was not cleared by clearml or pytorch. Since both seem to utilize the tmp folder for storing files. My error anyway was generated by Pytorch:
https://discuss.pytorch.org/t/num-workers-in-dataloader-always-gives-this-error/64718
The /tmp was full, and pytorch tried moving the /tmp to a local directory which is a network nfs drive, hence the error (too may connections to something). So the issue was a full /tmp that wasn't cleared, though I am not sure which program did not clear it, pytorch or clearml. Most likely of trainings that died prematurely left leftovers.
142 Views
0
Answers
2 years ago
one year ago
Tags