Hi @<1828965837906644992:profile|WackyDolphin95> , what about not connecting the new dataset to the parent, this way you can have a dataset only with the new files.
I wonder, is this stye of data set handling trying to square the circle with ClearML? Is it built for this type of stuff
Hi @<1523701070390366208:profile|CostlyOstrich36> - Cheers for your time
I thought about that, but I think the lineage feature is really valuable.
I've opted for this as a go to pattern now to achieve what I wanted. I literally just remove all files in the new dataset before finalizing it
with TemporaryDirectory() as tmp:
out = Path(tmp) / "df_clean.parquet"
result.to_parquet(out, index=False)
clean = Dataset.create(
dataset_name="clean-data",
dataset_project="example",
parent_datasets=[parent],
use_current_task=True
)
for file in clean.list_files():
clean.remove_files(file)
clean.add_files(out)
clean.finalize(auto_upload=True)