Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hi Everyone, Is It Possible To Not Create A Copy Of A Dataset When Adding To Clearml? My Data Is Already In A Directory On The Clearml-Server Machine And I Do Not Want To Copy It, Just Add It To Clearml As Dataset.

Hi everyone,
is it possible to not create a copy of a dataset when adding to clearml? My data is already in a directory on the clearml-server machine and I do not want to copy it, just add it to clearml as dataset.

  
  
Posted 2 years ago
Votes Newest

Answers 24


How about instead of uploading the entire dataset to the clearml server, upload a text file with the location of the dataset on the machine. I would think that should do the trick.

  
  
Posted 2 years ago

Sounds like a good hack, but not like a good solution 😄 But thank you anyways! 🙂

  
  
Posted 2 years ago

I normally just upload the data to the ClearML server and then remove it locally from my machine but I understand that isn't what you want. A quick hack was the only thing I could come up with at the moment xd. Anyway you're welcome. Hope you find a solution.

  
  
Posted 2 years ago

Yea, the real problem is that I have very large datasets in network storage. I am looking for a way to add the datasets on the networks storage as clearml-dataset.

  
  
Posted 2 years ago

And clearml-agent should pull these datasets from network storage...

  
  
Posted 2 years ago

Maybe a related question: Anyone every worked with datasets larger than the clearml-agent cache? Some colleague of mine has a dataset of ~ 1 tera byte...

  
  
Posted 2 years ago

I understand your problem. I think you normally can specify where you want the data to be stored in a conf file somewhere. people here can better guide you. However in my experience, it kinda uploads the data and stores it in its own format.

  
  
Posted 2 years ago

AgitatedDove14 SuccessfulKoala55 Could you briefly explain whether clearml supports no-copy add for datasets?

  
  
Posted 2 years ago

My data is already in a directory on the clearml-server machine and I do not want to copy it, just add it to clearml as dataset.

So the short answer is, no, it needs to packager it (read "zip it")
The reason is clearml-data creates an Immutable copy, and just "pointing" to files located somewhere will usually break very easily.
That said, actually it will be relatively easy to add as dataset itself stores links to the files and these links could actually point to an S3 bucket (for example)
wdyt?

  
  
Posted 2 years ago

Thank you for answering. So your suggestion would be similar to VexedCat68 's first idea, right?

  
  
Posted 2 years ago

Yes, consider VexedCat68 txt file the Dataset "content" , this will enable ypu to safely get the list of files, and then you can use the StorageManager to download them extend this concept and have it built into the Dataset itself, i.e. allow you to add files as links and make sure it will just download them. The caveat here is that the Dataset at the end, returns a folder with the files, when you specify links, you have to also specify the target location locally (at the end you want a folder with everything there), make sense ?

  
  
Posted 2 years ago

AgitatedDove14 Your second option is somewhat like how shortcuts work right? Storing pointers to the actual data?

  
  
Posted 2 years ago

Yes, though the main caveat is the data is not really immutable 😞

  
  
Posted 2 years ago

I understand that storing data outside ClearML won't ensure its immutability. I guess this can be built in as a feature into ClearML at some future point.

  
  
Posted 2 years ago

Sounds good. I think it is obvious that immutability has to be managed by the user then, but this is not different from not using clearml-data, so not a disadvantage in my opinion.

  
  
Posted 2 years ago

. I guess this can be built in as a feature into ClearML at some future point.

VexedCat68 you mean referencing an external link?

  
  
Posted 2 years ago

Yeah

  
  
Posted 2 years ago

but this is not different from not using clearml-data,

ReassuredTiger98 just making sure we are on the same page. clearml-data immutability is fixed, the user cannot change the content of the dataset (it is actually compressed and uploaded). If you want to change it, you create a new child version

  
  
Posted 2 years ago

VexedCat68 make sense, we could also (if implementing this feature) add a special Tag to the dataset , so you know it contains "external" links, wdyt?

  
  
Posted 2 years ago

Yeah, that would ensure its clarity.

  
  
Posted 2 years ago

Good point

  
  
Posted 2 years ago

Anyone wants to open a github issue, so we actually end up implementing it 😉 ?

  
  
Posted 2 years ago

Yea, the clearml-data is immutable, but not the underlying data if I just store a pointer to some location.

  
  
Posted 2 years ago

I ll add creating an issue to my todo list

  
  
Posted 2 years ago
1K Views
24 Answers
2 years ago
one year ago
Tags
Similar posts