Unanswered
Hi All,
Juts Learning The Ropes Of Clearml Atm. And Am Doing A Really Simple Etl Pipeline: Raw Data -> Clean Data
My Current Approach Is In One Script, I Add The Raw Data File To A Dataset In The Project:
# Register_Raw.Ipynb
Hi @<1523701070390366208:profile|CostlyOstrich36> - Cheers for your time
I thought about that, but I think the lineage feature is really valuable.
I've opted for this as a go to pattern now to achieve what I wanted. I literally just remove all files in the new dataset before finalizing it
with TemporaryDirectory() as tmp:
out = Path(tmp) / "df_clean.parquet"
result.to_parquet(out, index=False)
clean = Dataset.create(
dataset_name="clean-data",
dataset_project="example",
parent_datasets=[parent],
use_current_task=True
)
for file in clean.list_files():
clean.remove_files(file)
clean.add_files(out)
clean.finalize(auto_upload=True)
76 Views
0
Answers
4 months ago
4 months ago