Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hello! I Have A Small Question Regarding Storage Data Retrieval With Clearml

Hello! I have a small question regarding storage data retrieval with ClearML 😉

Context:
My team uploads thousands of data samples for training as one ClearML dataset. Currently, during training of our models, we spin up a ClearML GPU instance, and download inside it all the data into the local cache (using ClearML dataset sdk and get_local_copy function). From there, we are able to read the data and interact with it. However, it takes forever to download as we have dozens/hundred of GB to download.

Question:
I am looking into a way not to download locally (inside the AWS GPU instance) the ClearML dataset, but to kinda mount a directory directly to the Azure storage directory where our data is stored. I digged into the documentation, and found out storage direct_access . Is it the way to interact with the stored data from the ClearML GPU instance, without downloading it ? What is the solution to this issue? We could also mount a AWS Ec2 instance directory to Azure location where the data is, but I am not sure it is possible using the AWS autoscaler provided by ClearML?

Many thanks

  
  
Posted 6 months ago
Votes Newest

Answers 3


One possible solution I could see as well, is putting the data storage to S3 bucket to improve download performance as it is the same cloud provider. No transfer latency.

  
  
Posted 6 months ago

Hi @<1556812486840160256:profile|SuccessfulRaven86> , using an S3 bucket in the same region will surely improve performance (it's also without transfer fees, to that's a big plus 🙂 ).
Regarding mounting external storage into a directory, you do not need to actually define any direct storage for that, simply to make sure you direct the ClearML SDK storage.cache.default_base_dir to that folder - the ClearML caching should take care of the rest.
BTW, a faster cache (faster than mounting an object storage bucket, usually) could be setting up a cloud instance (EC2 instance in AWS, for example) with attached storage (EBS, in this example) and an NFS service, and mounting this storage on each machine you spin up using an NFS mount.

  
  
Posted 6 months ago

Just keep in mind my your bottleneck will be the transfer rate. So mounting will not save you anything as you still need to transfer the whole dataset sooner or later to your GPU instance.
One solution is as Jake suggest. The other can be pre-download the data to your instance with a CPU only cheap instance type, then restart the instance with GPU.

  
  
Posted 5 months ago
354 Views
3 Answers
6 months ago
5 months ago
Tags
Similar posts