Hi Guys! Is There Any Way To Get Full State Of Dataset From Somewhere, Except My S3 Bucket? I Need A Mapping Of Files And Batches That Were Uploaded As A Dataset. Maybe This Information Is Also Available In One Of The Clearml Databases?

Answered

Hi guys!

Is there any way to get full state of dataset from somewhere, except my S3 bucket?
I need a mapping of files and batches that were uploaded as a dataset. Maybe this information is also available in one of the ClearML databases?

  				
Posted 
	one year ago

					More  		
  Report
		
					CornyHedgehog13
				
					0
					 × 1

Votes Newest

Answers 6

I also though clearML writes this mapping ( state.json ) into one of its databases: Mongo, Redis, Elasticsearch.

  				
Posted 
	one year ago

					More  		
  Report
		
					CornyHedgehog13
				
					0
					 × 1

Hi CornyHedgehog13 , you can only see a list of files inside a dataset/version. I'm afraid you can't really pull individual files since everything is compressed and chunked. You can download individual chunks.

Regarding the second point - there is nothing out of the box but you can get a list of files in all datasets and then compare if some file exists in others.

Does that make sense?

  				
Posted 
	one year ago

					More  		
  Report
		
					CostlyOstrich36
				
					0

Thank you CostlyOstrich36 🤓

  				
Posted 
	one year ago

					More  		
  Report
		
					CornyHedgehog13
				
					0
					 × 1

o, if I pull this file from s3 bucket, I can conclude which chunk I should download to get a specific file. Am I wrong?

I think you're right. Although I'm not sure if you can decompress individual chunks - worth giving it a try!

I also though clearML writes this mapping (

state.json

) into one of its databases: Mongo, Redis, Elasticsearch.

I think the state.json is saved like an artifact so the contents aren't really exposed into one of the dbs

  				
Posted 
	one year ago

					More  		
  Report
		
					CostlyOstrich36
				
					0

Hi CostlyOstrich36 . Thank you for your advise, it definitely makes sense. Regarding to the first point, each dataset has a file state.json . In this file there os a key artifact_name e.g., data , data_001 , etc, and relative path of a file. I thought I can map this key with the chunk number. So, if I pull this file from s3 bucket, I can conclude which chunk I should download to get a specific file. Am I wrong?

  				
Posted 
	one year ago

					More  		
  Report
		
					CornyHedgehog13
				
					0
					 × 1

I can add a little piece of context.

I want to give my users a way to pic a specific batch to get a file they need. Right now there is no way to download just one specific file from an entire dataset.
I need a way to check whether a file has already been uploaded to some other dataset or not.

  				
Posted 
	one year ago

					More  		
  Report
		
					CornyHedgehog13
				
					0
					 × 1

Write your answer

1K Views

6 Answers

one year ago