Hi, We Have A Use Case That We Would Like To Upload A Local Folder Into The Cloud

Answered

Hi,
We have a use case that we would like to upload a local folder into the cloud AS-IS - without compressing or breaking it up into chunks.
I tried running the upload command as follows -
ds.upload(output_url='gs://<BUCKET>/', compression=0, chunk_size=1)but the result is that each file is converted into a single folder with the a zipfile.
I'm guessing that the solution would require to pass to the ParallelZipper a different object (instead of the https://github.com/allegroai/clearml/blob/0e283dd514bce2366584435a91c2ffa95340343b/clearml/utilities/parallel.py#L192 )
Is this the correct approach?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					OutrageousSheep60
				
					0
					 × 1

Votes Newest

Answers 9

OutrageousSheep60 before I can answer, maybe you can explain why "zipping" them does not fit your workfow ?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

I think the main difference is that I can see a value of having access to the raw format within the cloud vendor and not only have it as an archive

I see it does make sense.
Two options, one, as you mentioned use the ClearML StorageManager to upload the files, then register them as external links with Dataset.
Two, I know the enterprise tier has HyperDatasets, that are essentially what you describe, with version control over the "metadata" and "raw storage" on the GCP, including the ability to review the files from the web UI. Unfortunately there is no direct equivalent in the open-source version

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

OutrageousSheep60 so if this is the case I think you need to add "external links" i.e. upload the individual files to GCS, then register the links to GCS, does that make sense ?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Hi OutrageousSheep60

AS-IS

without compressing or breaking it up into chunks.

So for that I would suggest to manually archive it, and upload as external link?
Or are you saying you want to control the compression used by Dataset class ?
https://github.com/allegroai/clearml/blob/72d9b22e0d27f317a364acfeacbcf5c70f852e8c/clearml/datasets/dataset.py#L603

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

In order to create a webdataset we need to create tar files -
so we need to unzip and then recreate the tar file.
Additionally when the files are in GCS in the raw format you can easily review them with the preview (e.g. a wav file can be directly listened within the GCP console - web browser).
I think the main difference is that I can see a value of having access to the raw format within the cloud vendor and not only have it as an archive

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					OutrageousSheep60
				
					0
					 × 1

OutrageousSheep60 so this should work, no?
ds.upload(output_url='gs://<BUCKET>/', compression=0, chunk_size=100000000000)Notice the chunk size is the maximum size (in bytes) per chunk, so it should basically very large

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

This does not work -
Since all the files are stored as a single ZIP file (which if unzipped will have all the data), but we would like to have access to the raw files in there original format.

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					OutrageousSheep60
				
					0
					 × 1

That is a workaround - but surly not optimal
If we want to generate a dataset from a set of files that are on a local computer (e.g. a local GPU workstation then ran some media transformation) -
then instead of creating the Dataset directly - we need to first upload them and only then use the ClearML sdk.
Do you see any option integrating this kind of workflow into clearml?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					OutrageousSheep60
				
					0
					 × 1

we want to use the dataset output_uri as a common ground to create additional dataset formats such as https://webdataset.github.io/webdataset/

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					OutrageousSheep60
				
					0
					 × 1

Write your answer

2K Views

9 Answers

2 years ago