Only one. Will replicate it in detail and see what’s actually up
AgitatedDove14 - this was an interesting one. I think I have found the issue, but verifying the fix as of now.
One of the devs was using shutil.copy2
to copy parts of dataset to a temporary directory in a with
block - something like:
with TemporaryDirectory(dir=temp_dir) as certificates_directory: for file in test_paths: shutil.copy2(f"{dataset_local}/{file}", f"{certificates_directory}/file")
My suspicion is since copy2 copies with full data and symlinks, it’s messing with the way clearml-data sets up datasets with symlinks to parents etc and when the temp directories are cleaned due to the with
blocks the local copies are cleared too
Number of entries in the dataset cache can be controlled via cleaml.conf : sdk.storage.cache.default_cache_manager_size
And I think the default is 100 entries, so it should not get cleaned.
and then they are all removed and for a particular task it even happens before my task is done
Is this reproducible ? Who is cleaning it and when?
Hmm, Notice that it does store sym links to parent data versions (to save on multiple copies of the same file). If you call get_mutable_local_copy() you will get a standalone copy
Fix - use shutil.copy
instead of shutil.copy2
- verifying now.
Also it’s not happening when running locally, but only in remote on a agent
AgitatedDove14 - worked with mutable copy! So was definitely related to the symlinks in some form
So was definitely related to the symlinks in some form
could it be it actually deleted the cache? How many agents are running on the same machine ?