I can add a little piece of context.
- I want to give my users a way to pic a specific batch to get a file they need. Right now there is no way to download just one specific file from an entire dataset.
- I need a way to check whether a file has already been uploaded to some other dataset or not.
Hi @<1584716355783888896:profile|CornyHedgehog13> , you can only see a list of files inside a dataset/version. I'm afraid you can't really pull individual files since everything is compressed and chunked. You can download individual chunks.
Regarding the second point - there is nothing out of the box but you can get a list of files in all datasets and then compare if some file exists in others.
Does that make sense?
Hi @<1523701070390366208:profile|CostlyOstrich36> . Thank you for your advise, it definitely makes sense. Regarding to the first point, each dataset has a file state.json
. In this file there os a key artifact_name
e.g., data
, data_001
, etc, and relative path of a file. I thought I can map this key with the chunk number. So, if I pull this file from s3 bucket, I can conclude which chunk I should download to get a specific file. Am I wrong?
I also though clearML writes this mapping ( state.json
) into one of its databases: Mongo, Redis, Elasticsearch.
o, if I pull this file from s3 bucket, I can conclude which chunk I should download to get a specific file. Am I wrong?
I think you're right. Although I'm not sure if you can decompress individual chunks - worth giving it a try!
I also though clearML writes this mapping (
state.json
) into one of its databases: Mongo, Redis, Elasticsearch.
I think the state.json is saved like an artifact so the contents aren't really exposed into one of the dbs
Thank you @<1523701070390366208:profile|CostlyOstrich36> 🤓