Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hi, Love What You Guys Did With The New Datasets! I Need Some Help Though. I Assume There Will Be A No-Code Way To Do This, Maybe Not Now But In The Future. But Anyway, I Have Three Different Datasets, And I Want To Create A Merged Version Of All Three Of

Hi, love what you guys did with the new datasets!
I need some help though.
I assume there will be a no-code way to do this, maybe not now but in the future. But anyway, I have three different datasets, and I want to create a merged version of all three of them so that when the user requests a single ID she will get all datasets downloaded at the same time. I do not wish for data duplication. Any Idea how to do this with clearml-data CLI/GUI/python?

  
  
Posted one year ago
Votes Newest

Answers 10


but can it NOT use /tmp for this i’m merging about 100GB

You mean to configure your Temp folder for when squashing ?
you can do hack the following:
` import tempfile
tempfile.tempdir = "/my/new/temp"

Dataset squash

tempfile.tempdir = None `But regradless I think this is worth a GitHub issue with feature request, to set the temp folder///

  
  
Posted one year ago

Yeah the hack would work but i’m trying to use it form the command line to put in airflow. I’ll post on GH

Oh, then set TMP/TMPDIR environment variable, it should have the same effect

  
  
Posted one year ago

Yeah the hack would work but i’m trying to use it form the command line to put in airflow. I’ll post on GH

  
  
Posted one year ago

ok scratch that - you can override TMPDIR in the env. much better!

  
  
Posted one year ago

hi GrittyStarfish67
"Hi, love what you guys did with the new datasets!" Thanks 🙂 !

you can squash the datasets together : it will result in the creation of a child dataset, that will contain its parents data merged together. Note that there will be no duplicate upload of the parents data : when a dataset inherits from parents datasets, it receives the references to the data uploaded by the parents.
SDK: https://clear.ml/docs/latest/docs/clearml_data/clearml_data_sdk#datasetsquash
CLI: https://clear.ml/docs/latest/docs/clearml_data/clearml_data_cli#squash

you can also create a new dataset and specify some parents dataset using the -- parents parameter. the behavior will be the same
SDK: https://clear.ml/docs/latest/docs/clearml_data/clearml_data_sdk#datasetcreate
CLI: https://clear.ml/docs/latest/docs/clearml_data/clearml_data_cli#create

  
  
Posted one year ago

GrittyStarfish67

I do not wish for data duplication. Any Idea how to do this with clearml-data CLI/GUI/python?

At least in theory creating a new version with parents from multiple Datasets should just work out of the box.
wdyt?

  
  
Posted one year ago

SweetBadger76 , AgitatedDove14 , creating a dataset with parents worked very well and produced great visuals on the UI!

  
  
Posted one year ago

creating a dataset with parents worked very well and produced great visuals on the UI!

woot woot!

I tried the squash solution, however this somehow caused a download of all the datasets into my

so this actually works, kind or like git squash, bottom line it will repackage the data from all the different versions into one new version. This means downloading the data from all squashed versions, then repackaging it into a single new version. Make sense ?

  
  
Posted one year ago

super makes sense, but can it NOT use /tmp for this i’m merging about 100GB of files and it is quite heavy on the partition. maybe I could put an env variable to divert it to scratch?

  
  
Posted one year ago

AgitatedDove14 I tried the squash solution, however this somehow caused a download of all the datasets into my /tmp folder, filling up the instance? I have a special drive for .clearml cache, how can I tell clearml-data to only use that?

  
  
Posted one year ago
729 Views
10 Answers
one year ago
one year ago
Tags