Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hi, I Am Trying To Use Clearml-Data To Upload My Data To S3, Which Is Password Protected. How Should I Indicate The Credentials After I Set --Storage S3://.... ?

Hi, i am trying to use clearml-data to upload my data to S3, which is password protected. How should i indicate the credentials after i set --storage s3://.... ?

  
  
Posted 3 years ago
Votes Newest

Answers 7


Got that thanks. Just to better understand. When clearml-data upload my recursive folder of image data, it convert it into a compressed form with a different folder structure than the original datasets.

When my software pull the data, i'm returned a str. How would we manipulate the data from there?

  
  
Posted 3 years ago

let me check if I can think about something else (I know the enterprise edition has full support for such thing and for unstructured data too).

BTW ClearML always use cache, so the big download is done only once.

  
  
Posted 3 years ago

SubstantialElk6 you can try:

dataset_upload_task = Dataset.get(dataset_id=dataset_task) path_with_data = dataset_upload_task.get_local_copy()

  
  
Posted 3 years ago

Hi SubstantialElk6 ,

You can configuration S3 credentials on your ~/clearml.conf file, or with environment variables:
os.environ['AWS_ACCESS_KEY_ID'] ="***" os.environ['AWS_SECRET_ACCESS_KEY'] = "***" os.environ['AWS_DEFAULT_REGION'] = "***"

  
  
Posted 3 years ago

I see, so its a path. Another question, as far as i can tell, clearml-data will download entire datasets before starting training. This isn't very ideal when we are dealing with billions of datasets (E.g. WE might want to download a subset at a time, send to GPU for training and then use the CPU to concurrently pull another subset.). Any comments on this?

  
  
Posted 3 years ago

like create multiple datasets?
create parent (all) - upload to S3
create child1 (first 100k)
create child2 (second 100k)...blah blah

Then only pull indices from children. Technically workable but not sure if its best approach since different ppl have different batch sizes in mind.

  
  
Posted 3 years ago

get_local_copy() will return the entire dataset, but you can divide the dataset parts and have the same parent for all of them, can this work?

  
  
Posted 3 years ago
1K Views
7 Answers
3 years ago
one year ago
Tags