Hello! I Have A Small Question Regarding Storage Data Retrieval With Clearml

Answered

Hello! I have a small question regarding storage data retrieval with ClearML 😉

Context:
My team uploads thousands of data samples for training as one ClearML dataset. Currently, during training of our models, we spin up a ClearML GPU instance, and download inside it all the data into the local cache (using ClearML dataset sdk and get_local_copy function). From there, we are able to read the data and interact with it. However, it takes forever to download as we have dozens/hundred of GB to download.

Question:
I am looking into a way not to download locally (inside the AWS GPU instance) the ClearML dataset, but to kinda mount a directory directly to the Azure storage directory where our data is stored. I digged into the documentation, and found out storage direct_access . Is it the way to interact with the stored data from the ClearML GPU instance, without downloading it ? What is the solution to this issue? We could also mount a AWS Ec2 instance directory to Azure location where the data is, but I am not sure it is possible using the AWS autoscaler provided by ClearML?

Many thanks

  				
Posted 
	one year ago

					More
				  		
  Report
		
					SuccessfulRaven86
				
					0
					 × 1

Votes Newest

Answers 3

One possible solution I could see as well, is putting the data storage to S3 bucket to improve download performance as it is the same cloud provider. No transfer latency.

  				
Posted 
	one year ago

					More
				  		
  Report
		
					SuccessfulRaven86
				
					0
					 × 1

Hi @<1556812486840160256:profile|SuccessfulRaven86> , using an S3 bucket in the same region will surely improve performance (it's also without transfer fees, to that's a big plus 🙂 ).
Regarding mounting external storage into a directory, you do not need to actually define any direct storage for that, simply to make sure you direct the ClearML SDK storage.cache.default_base_dir to that folder - the ClearML caching should take care of the rest.
BTW, a faster cache (faster than mounting an object storage bucket, usually) could be setting up a cloud instance (EC2 instance in AWS, for example) with attached storage (EBS, in this example) and an NFS service, and mounting this storage on each machine you spin up using an NFS mount.

  				
Posted 
	one year ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Just keep in mind my your bottleneck will be the transfer rate. So mounting will not save you anything as you still need to transfer the whole dataset sooner or later to your GPU instance.
One solution is as Jake suggest. The other can be pre-download the data to your instance with a CPU only cheap instance type, then restart the instance with GPU.

  				
Posted 
	one year ago

					More
				  		
  Report
		
					ManiacalLizard2
				
					0
					 × 1

Write your answer

765 Views

3 Answers

one year ago