Also it’s not happening when running locally, but only in remote on a agent
Hmm, Notice that it does store sym links to parent data versions (to save on multiple copies of the same file). If you call get_mutable_local_copy() you will get a standalone copy
AgitatedDove14 - this was an interesting one. I think I have found the issue, but verifying the fix as of now.
One of the devs was using shutil.copy2
to copy parts of dataset to a temporary directory in a with
block - something like:
with TemporaryDirectory(dir=temp_dir) as certificates_directory: for file in test_paths: shutil.copy2(f"{dataset_local}/{file}", f"{certificates_directory}/file")
My suspicion is since copy2 copies with full data and symlinks, it’s messing with the way clearml-data sets up datasets with symlinks to parents etc and when the temp directories are cleaned due to the with
blocks the local copies are cleared too
AgitatedDove14 - worked with mutable copy! So was definitely related to the symlinks in some form
Fix - use shutil.copy
instead of shutil.copy2
- verifying now.
And I think the default is 100 entries, so it should not get cleaned.
and then they are all removed and for a particular task it even happens before my task is done
Is this reproducible ? Who is cleaning it and when?
Number of entries in the dataset cache can be controlled via cleaml.conf : sdk.storage.cache.default_cache_manager_size
So was definitely related to the symlinks in some form
could it be it actually deleted the cache? How many agents are running on the same machine ?
Only one. Will replicate it in detail and see what’s actually up