OSX 12.5.1
Python 3.8.1.
Clearml 1.13.1
"clearml-data add --folder ./*" always flattens everything, I have that reproducible 100%.
Hi ZanySealion18
sorry missed that one
The cache doesn't work, it attempts to download the dataset every time.
just making sure the dataset itself contains all the files?
Once I used clearml-data add --folder * CLI everything works correctly (though all files recursively ended up in the root, I had luck all were named differently).
Not sure I follow here, is the problem the creation of the dataset of fetching it? is this a single version or multiple versions?
Single version. The issue seems to be the creation. If I use "clearml-data sync --folder ." it says it uploaded all the files. Running "clearml-data verify --folder ." says it's all good. Metadata on the WebUI reports the expected number of files. However, once I extract the zips (or download the dataset through Python API or CLI) not all the files are there.
"clearml-data add --folder ./*" seems to fix this issue though it doesn't preserve my directory structure so I'd have to write a script to do it manually, but that shouldn't be necessary as clearml-data sync should already be doing that as far as I understand but it seems to have a bug there.
Once I used clearml-data add --folder * API everything works correctly (though all files recursively ended up in the root, I had luck all were named differently).
Clearml 1.13.1
Could you try the latest (1.16.2)? I remember there was a fix specific to Datasets
However, once I extract the zips (or download the dataset through Python API or CLI) not all the files are there.
and all the files are registered in the metadata? coulf you add --verbose
to the sync command to see what it is doing
"clearml-data add --folder ./*" seems to fix this issue though it doesn't preserve my directory structure
This is also odd, it should Not flatten the folder structure. What is your OS / Python / clearml version?
Is this reproducible ? if so, how could we reproduce and debug?
I have a dataset of ~24GB and I've tried multiple times uploading it with the sync function.
- The cache doesn't work, it attempts to download the dataset every time.
- It "misses" some files somehow. So once the job runs it fails due to missing files.
- I've ran verify afterwards (from the machine I used to upload the data) and it says it's all good. However, once I inspect the zip files on the server (look for the files in the specific zip the state json says they're in) the files are indeed missing.
AgitatedDove14 Any ideas on this issue? Thanks!