Suggestion

Answered

Suggestion

Suggestion : ClearML to clear "/tmp" directory once in a while from it's own created files
Explanation: We just encountered an error (very uninformative) "OSError: [Errno 16] Device or resource busy: '.nfs000000009a696f6d0000168f" in python multiprocessing library. From a ClearML process that caused the training to continue, but the reporting to ClearML server to stop. Debugging this, it turned out that the root partition is full, specifically, the "/tmp" contained alot of artifacts from ClearML past runs (1.6T in our case). We cleared them out manually, but it was very tiresome to debug and pinpoint the issue. I write this post for the community if someone in the future will encounter this. Also, might be a future feature, for the clearml jobs to clear tmp directory from junk created in past runs that was for some unknown reason not deleted. Or to remove it on end of run, I am just unsure what happens on a crash, if there is a mechanism for cleanup of /tmp.

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					SolidSealion72
				
					0
					 × 1

Votes Newest

Answers 3

Hi SolidSealion72

"/tmp" contained alot of artifacts from ClearML past runs (1.6T in our case).

How did you end up with 1.6TB of artifacts there? what are the workflows on that machine? at least in theory, there should not be any leftover in the tmp folder, after the process is completed.

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Hi AgitatedDove14
It appears that /tmp was not cleared, and in addition we upload many large artifacts through clearml.

I am not sure not if the /tmp was not cleared by clearml or pytorch. Since both seem to utilize the tmp folder for storing files. My error anyway was generated by Pytorch:
https://discuss.pytorch.org/t/num-workers-in-dataloader-always-gives-this-error/64718

The /tmp was full, and pytorch tried moving the /tmp to a local directory which is a network nfs drive, hence the error (too may connections to something). So the issue was a full /tmp that wasn't cleared, though I am not sure which program did not clear it, pytorch or clearml. Most likely of trainings that died prematurely left leftovers.

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					SolidSealion72
				
					0
					 × 1

SolidSealion72 this makes sense, clearml deletes artifacts/models after they are uploaded, so I have to assume these are torch internal files

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Write your answer

916 Views

3 Answers

2 years ago

one year ago