Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hi! How To Add Files Locally To

Hi! How to add files locally to dataset and then upload to a custom s3 location? The location should be specified within python code, NOT clearml.conf . default_output_uri is not an option! The most intuitive way is seemingly add_files then upload , from this example: https://github.com/allegroai/clearml/blob/master/examples/datasets/dataset_creation.py

BUT it doesn't work this way: add_files not only adds files, but also uploads them to a default location! Why? How to just add files locally?

  
  
Posted 3 years ago
Votes Newest

Answers 18


AgitatedDove14 yeah, that makes sense, thank you. That means I need to pass a single zip file to path argument in add_files , right?

The files themselves are not on S3 yet, they are stored locally. That's what I want: register a new dataset and upload the data itself to S3

  
  
Posted 3 years ago

So now it works smoothly

  
  
Posted 3 years ago

AgitatedDove14 Yes, this is exactly what I was looking for and was running into 413 Request Entity Too Large error during add_files . This is what helped to solve this:
https://github.com/allegroai/clearml/issues/257#issuecomment-736020929

  
  
Posted 3 years ago

MelancholyElk85

How do I add files without uploading them anywhere?

The files themselves need to be packaged into a zip file (so we have an immutable copy of the dataset). This means you cannot "register" existing files (in your example, files on your S3 bucket?!). The idea is to make sure your dataset is protected against changes on the one hand, but on the other to allow you to change it, and only store the changeset.
Does that make sense ?

  
  
Posted 3 years ago

at means I need to pass a single zip file to 

path

 argument in 

add_files

 , right?

actually the opposite, you pass a folder (of files) to add_files. Then add_files remembers the files location (and pre calculates the hash of the files content). When you call upload it will actually compress the files that changed into a zip file (or files depending on the chunk size), and upload the files to the destination (as specified in the upload call).
If you pass the s3://bucket/folder as output destination for the upload call, clearml will automatically create a subfolder for the dataset and upload the compressed Zip file there.
Is this what you are looking for ?

  
  
Posted 3 years ago

CostlyOstrich36 thank you for the quick answer! I tried it but there is still 413 Request Entity Too Large error, as if it still uses a default fileserver

  
  
Posted 3 years ago

CostlyOstrich36 hi! yes, as I expected, it doesn't see any files unless I call add_files first

But add_files has no output_url parameter and tries to upload to the default place. This returns 413 Request Entity Too Large error because there are too many files, so using the default location is not an option. Could you please help with this?

  
  
Posted 3 years ago

MelancholyElk85 , I think the upload() function has got the parameter you need: output_uri

https://github.com/allegroai/clearml/blob/a68f832a8a12665f7705cfbf14c5fe195f6d7469/clearml/datasets/dataset.py#L323

  
  
Posted 3 years ago

MelancholyElk85 , it looks like add_files has the following parameter: dataset_path
Try with it 🙂

  
  
Posted 3 years ago

From the looks of it, yes. But give it a try to see how it behaves without

  
  
Posted 3 years ago

CostlyOstrich36 there is an old similar thread, but they recommend changing the config

https://clearml.slack.com/archives/CTK20V944/p1626722835308600?thread_ts=1626600358.282400&cid=CTK20V944

  
  
Posted 3 years ago

there seems to be no way to change default_output_uri from the code.

Dataset.create calls Task.create which in turn accepts add_task_init_call flag. Task.init accepts output_uri , but we cannot add arguments with add_task_init_call , so we cannot change output_uri from Dataset.create , right?

  
  
Posted 3 years ago

Changing sdk.development.default_output_uri in clearml.conf seems to be bad idea, because different datasets will likely have different addresses on S3

  
  
Posted 3 years ago

I'm afraid that would be the best method. You could probably hack something into clearml sdk yourself since it's open source

  
  
Posted 3 years ago

add_files . There is no upload call, because add_files uploads files by itself, if I got it correctly

  
  
Posted 3 years ago

AgitatedDove14 SuccessfulKoala55 maybe you know. How do I add files without uploading them anywhere?

  
  
Posted 3 years ago

Yeah, but do I need to call add_files first?

  
  
Posted 3 years ago

Does it fail at add_files or at upload ?

  
  
Posted 3 years ago
1K Views
18 Answers
3 years ago
one year ago
Tags
Similar posts