Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Suggestion

Suggestion : ClearML to clear "/tmp" directory once in a while from it's own created files
Explanation: We just encountered an error (very uninformative) "OSError: [Errno 16] Device or resource busy: '.nfs000000009a696f6d0000168f" in python multiprocessing library. From a ClearML process that caused the training to continue, but the reporting to ClearML server to stop. Debugging this, it turned out that the root partition is full, specifically, the "/tmp" contained alot of artifacts from ClearML past runs (1.6T in our case). We cleared them out manually, but it was very tiresome to debug and pinpoint the issue. I write this post for the community if someone in the future will encounter this. Also, might be a future feature, for the clearml jobs to clear tmp directory from junk created in past runs that was for some unknown reason not deleted. Or to remove it on end of run, I am just unsure what happens on a crash, if there is a mechanism for cleanup of /tmp.

  
  
Posted 2 years ago
Votes Newest

Answers 3


SolidSealion72 this makes sense, clearml deletes artifacts/models after they are uploaded, so I have to assume these are torch internal files

  
  
Posted 2 years ago

Hi AgitatedDove14
It appears that /tmp was not cleared, and in addition we upload many large artifacts through clearml.

I am not sure not if the /tmp was not cleared by clearml or pytorch. Since both seem to utilize the tmp folder for storing files. My error anyway was generated by Pytorch:
https://discuss.pytorch.org/t/num-workers-in-dataloader-always-gives-this-error/64718

The /tmp was full, and pytorch tried moving the /tmp to a local directory which is a network nfs drive, hence the error (too may connections to something). So the issue was a full /tmp that wasn't cleared, though I am not sure which program did not clear it, pytorch or clearml. Most likely of trainings that died prematurely left leftovers.

  
  
Posted 2 years ago

Hi SolidSealion72

"/tmp" contained alot of artifacts from ClearML past runs (1.6T in our case).

How did you end up with 1.6TB of artifacts there? what are the workflows on that machine? at least in theory, there should not be any leftover in the tmp folder, after the process is completed.

  
  
Posted 2 years ago
898 Views
3 Answers
2 years ago
one year ago
Tags