Hi, I Am Trying To Use Clearml-Data To Upload My Data To S3, Which Is Password Protected. How Should I Indicate The Credentials After I Set --Storage S3://.... ?

Answered

Hi, i am trying to use clearml-data to upload my data to S3, which is password protected. How should i indicate the credentials after i set --storage s3://.... ?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					SubstantialElk6
				
					0
					 × 1

Votes Newest

Answers 7

Hi SubstantialElk6 ,

You can configuration S3 credentials on your ~/clearml.conf file, or with environment variables:
os.environ['AWS_ACCESS_KEY_ID'] ="***" os.environ['AWS_SECRET_ACCESS_KEY'] = "***" os.environ['AWS_DEFAULT_REGION'] = "***"

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					TimelyPenguin76
				
					0
					 Administrator

I see, so its a path. Another question, as far as i can tell, clearml-data will download entire datasets before starting training. This isn't very ideal when we are dealing with billions of datasets (E.g. WE might want to download a subset at a time, send to GPU for training and then use the CPU to concurrently pull another subset.). Any comments on this?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					SubstantialElk6
				
					0
					 × 1

let me check if I can think about something else (I know the enterprise edition has full support for such thing and for unstructured data too).

BTW ClearML always use cache, so the big download is done only once.

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					TimelyPenguin76
				
					0
					 Administrator

Got that thanks. Just to better understand. When clearml-data upload my recursive folder of image data, it convert it into a compressed form with a different folder structure than the original datasets.

When my software pull the data, i'm returned a str. How would we manipulate the data from there?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					SubstantialElk6
				
					0
					 × 1

SubstantialElk6 you can try:

dataset_upload_task = Dataset.get(dataset_id=dataset_task) path_with_data = dataset_upload_task.get_local_copy()

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					TimelyPenguin76
				
					0
					 Administrator

like create multiple datasets?
create parent (all) - upload to S3
create child1 (first 100k)
create child2 (second 100k)...blah blah

Then only pull indices from children. Technically workable but not sure if its best approach since different ppl have different batch sizes in mind.

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					SubstantialElk6
				
					0
					 × 1

get_local_copy() will return the entire dataset, but you can divide the dataset parts and have the same parent for all of them, can this work?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					TimelyPenguin76
				
					0
					 Administrator

Write your answer

2K Views

7 Answers

4 years ago

2 years ago