Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hi Guys! Is There Any Way To Get Full State Of Dataset From Somewhere, Except My S3 Bucket? I Need A Mapping Of Files And Batches That Were Uploaded As A Dataset. Maybe This Information Is Also Available In One Of The Clearml Databases?

Hi guys!

Is there any way to get full state of dataset from somewhere, except my S3 bucket?
I need a mapping of files and batches that were uploaded as a dataset. Maybe this information is also available in one of the ClearML databases?

  
  
Posted one year ago
Votes Newest

Answers 6


Hi @<1584716355783888896:profile|CornyHedgehog13> , you can only see a list of files inside a dataset/version. I'm afraid you can't really pull individual files since everything is compressed and chunked. You can download individual chunks.

Regarding the second point - there is nothing out of the box but you can get a list of files in all datasets and then compare if some file exists in others.

Does that make sense?

  
  
Posted one year ago

I also though clearML writes this mapping ( state.json ) into one of its databases: Mongo, Redis, Elasticsearch.

  
  
Posted one year ago

o, if I pull this file from s3 bucket, I can conclude which chunk I should download to get a specific file. Am I wrong?

I think you're right. Although I'm not sure if you can decompress individual chunks - worth giving it a try!

I also though clearML writes this mapping (

state.json

) into one of its databases: Mongo, Redis, Elasticsearch.

I think the state.json is saved like an artifact so the contents aren't really exposed into one of the dbs

  
  
Posted one year ago

Thank you @<1523701070390366208:profile|CostlyOstrich36> 🤓

  
  
Posted one year ago

I can add a little piece of context.

  • I want to give my users a way to pic a specific batch to get a file they need. Right now there is no way to download just one specific file from an entire dataset.
  • I need a way to check whether a file has already been uploaded to some other dataset or not.
  
  
Posted one year ago

Hi @<1523701070390366208:profile|CostlyOstrich36> . Thank you for your advise, it definitely makes sense. Regarding to the first point, each dataset has a file state.json . In this file there os a key artifact_name e.g., data , data_001 , etc, and relative path of a file. I thought I can map this key with the chunk number. So, if I pull this file from s3 bucket, I can conclude which chunk I should download to get a specific file. Am I wrong?

  
  
Posted one year ago