GrittyStarfish67
I do not wish for data duplication. Any Idea how to do this with clearml-data CLI/GUI/python?
At least in theory creating a new version with parents from multiple Datasets should just work out of the box.
wdyt?
but can it NOT use /tmp for this i’m merging about 100GB
You mean to configure your Temp folder for when squashing ?
you can do hack the following:
` import tempfile
tempfile.tempdir = "/my/new/temp"
Dataset squash
tempfile.tempdir = None `But regradless I think this is worth a GitHub issue with feature request, to set the temp folder///
Yeah the hack would work but i’m trying to use it form the command line to put in airflow. I’ll post on GH
Oh, then set TMP/TMPDIR environment variable, it should have the same effect
SweetBadger76 , AgitatedDove14 , creating a dataset with parents worked very well and produced great visuals on the UI!
hi GrittyStarfish67
"Hi, love what you guys did with the new datasets!" Thanks 🙂 !
you can squash the datasets together : it will result in the creation of a child dataset, that will contain its parents data merged together. Note that there will be no duplicate upload of the parents data : when a dataset inherits from parents datasets, it receives the references to the data uploaded by the parents.
SDK: https://clear.ml/docs/latest/docs/clearml_data/clearml_data_sdk#datasetsquash
CLI: https://clear.ml/docs/latest/docs/clearml_data/clearml_data_cli#squash
you can also create a new dataset and specify some parents dataset using the -- parents parameter. the behavior will be the same
SDK: https://clear.ml/docs/latest/docs/clearml_data/clearml_data_sdk#datasetcreate
CLI: https://clear.ml/docs/latest/docs/clearml_data/clearml_data_cli#create
creating a dataset with parents worked very well and produced great visuals on the UI!
woot woot!
I tried the squash solution, however this somehow caused a download of all the datasets into my
so this actually works, kind or like git squash, bottom line it will repackage the data from all the different versions into one new version. This means downloading the data from all squashed versions, then repackaging it into a single new version. Make sense ?
AgitatedDove14 I tried the squash solution, however this somehow caused a download of all the datasets into my /tmp folder, filling up the instance? I have a special drive for .clearml cache, how can I tell clearml-data to only use that?
Yeah the hack would work but i’m trying to use it form the command line to put in airflow. I’ll post on GH
ok scratch that - you can override TMPDIR in the env. much better!
super makes sense, but can it NOT use /tmp for this i’m merging about 100GB of files and it is quite heavy on the partition. maybe I could put an env variable to divert it to scratch?