Answered

Hi, Love What You Guys Did With The New Datasets! I Need Some Help Though. I Assume There Will Be A No-Code Way To Do This, Maybe Not Now But In The Future. But Anyway, I Have Three Different Datasets, And I Want To Create A Merged Version Of All Three Of

Hi, love what you guys did with the new datasets!
I need some help though.
I assume there will be a no-code way to do this, maybe not now but in the future. But anyway, I have three different datasets, and I want to create a merged version of all three of them so that when the user requests a single ID she will get all datasets downloaded at the same time. I do not wish for data duplication. Any Idea how to do this with clearml-data CLI/GUI/python?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					GrittyStarfish67
				
					0
					 × 1

Votes Newest

Answers 10

hi GrittyStarfish67
"Hi, love what you guys did with the new datasets!" Thanks 🙂 !

you can squash the datasets together : it will result in the creation of a child dataset, that will contain its parents data merged together. Note that there will be no duplicate upload of the parents data : when a dataset inherits from parents datasets, it receives the references to the data uploaded by the parents.
SDK: https://clear.ml/docs/latest/docs/clearml_data/clearml_data_sdk#datasetsquash
CLI: https://clear.ml/docs/latest/docs/clearml_data/clearml_data_cli#squash

you can also create a new dataset and specify some parents dataset using the -- parents parameter. the behavior will be the same
SDK: https://clear.ml/docs/latest/docs/clearml_data/clearml_data_sdk#datasetcreate
CLI: https://clear.ml/docs/latest/docs/clearml_data/clearml_data_cli#create

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SweetBadger76
				
					0
					 × 1

Yeah the hack would work but i’m trying to use it form the command line to put in airflow. I’ll post on GH

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					GrittyStarfish67
				
					0
					 × 1

Yeah the hack would work but i’m trying to use it form the command line to put in airflow. I’ll post on GH

Oh, then set TMP/TMPDIR environment variable, it should have the same effect

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

creating a dataset with parents worked very well and produced great visuals on the UI!

woot woot!

I tried the squash solution, however this somehow caused a download of all the datasets into my

so this actually works, kind or like git squash, bottom line it will repackage the data from all the different versions into one new version. This means downloading the data from all squashed versions, then repackaging it into a single new version. Make sense ?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

ok scratch that - you can override TMPDIR in the env. much better!

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					GrittyStarfish67
				
					0
					 × 1

SweetBadger76 , AgitatedDove14 , creating a dataset with parents worked very well and produced great visuals on the UI!

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					GrittyStarfish67
				
					0
					 × 1

super makes sense, but can it NOT use /tmp for this i’m merging about 100GB of files and it is quite heavy on the partition. maybe I could put an env variable to divert it to scratch?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					GrittyStarfish67
				
					0
					 × 1

but can it NOT use /tmp for this i’m merging about 100GB

You mean to configure your Temp folder for when squashing ?
you can do hack the following:
` import tempfile
tempfile.tempdir = "/my/new/temp"

Dataset squash

tempfile.tempdir = None `But regradless I think this is worth a GitHub issue with feature request, to set the temp folder///

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

GrittyStarfish67

I do not wish for data duplication. Any Idea how to do this with clearml-data CLI/GUI/python?

At least in theory creating a new version with parents from multiple Datasets should just work out of the box.
wdyt?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

AgitatedDove14 I tried the squash solution, however this somehow caused a download of all the datasets into my /tmp folder, filling up the instance? I have a special drive for .clearml cache, how can I tell clearml-data to only use that?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					GrittyStarfish67
				
					0
					 × 1

Write your answer

2K Views

10 Answers

3 years ago

2 years ago