Answered

Hi All, Juts Learning The Ropes Of Clearml Atm. And Am Doing A Really Simple Etl Pipeline: Raw Data -> Clean Data My Current Approach Is In One Script, I Add The Raw Data File To A Dataset In The Project: # Register_Raw.Ipynb

Hi all,

Juts learning the ropes of ClearML atm. And am doing a really simple ETL pipeline: raw data -> clean data

My current approach is in one script, I add the raw data file to a dataset in the project:

register_raw.ipynb

ds = Dataset.create(
    dataset_name="raw",
    dataset_project="example",
)

ds.add_files(path=local_file_path)

ds.finalize(auto_upload=True)

Then, for the ETL section, I have this approach

clean_data.ipynb

# Make task
task = Task.init(project_name="example", task_name="clean-raw")

# 1. pull raw
raw = Dataset.get(dataset_name="raw")
raw_data = pd.read_parquet(next(Path(raw.get_local_copy()).glob("*.parquet")))

# 2. Clean data
clean_data = raw_data[...]

# 3. Save data
with TemporaryDirectory() as tmp:
    out = Path(tmp) / "cleaned_data.parquet"
    clean_data.to_parquet(out, index=False)

    clean_ds = Dataset.create(
        dataset_name="clean-data",
        dataset_project="example",
        parent_datasets=[raw],
    )
    clean.add_files(out)
    clean.finalize(auto_upload=True)

But this seems wrong to me to do it this way? It makes two dataset objects (I guess that makes sense) but the new dataset "clean-data" contains both original, and new file. Which was not my intention.

Ideally, what I wanted was a pipeline that saved certain intermediate steps of the process. Is this the canonical way to achieve that?

  				
Posted 
	6 months ago

					More
				  		
  Report
		
					WackyDolphin95
				
					0
					 × 1

Votes Newest

Answers 3

Hi @<1523701070390366208:profile|CostlyOstrich36> - Cheers for your time

I thought about that, but I think the lineage feature is really valuable.

I've opted for this as a go to pattern now to achieve what I wanted. I literally just remove all files in the new dataset before finalizing it

with TemporaryDirectory() as tmp:
    out = Path(tmp) / "df_clean.parquet"
    result.to_parquet(out, index=False)

    clean = Dataset.create(
        dataset_name="clean-data",
        dataset_project="example",
        parent_datasets=[parent],
        use_current_task=True
    )

    for file in clean.list_files():
        clean.remove_files(file)

    clean.add_files(out)
    
    clean.finalize(auto_upload=True)

  				
Posted 
	6 months ago

					More
				  		
  Report
		
					WackyDolphin95
				
					0
					 × 1

Hi @<1828965837906644992:profile|WackyDolphin95> , what about not connecting the new dataset to the parent, this way you can have a dataset only with the new files.

  				
Posted 
	6 months ago

					More
				  		
  Report
		
					CostlyOstrich36
				
					0

I wonder, is this stye of data set handling trying to square the circle with ClearML? Is it built for this type of stuff

  				
Posted 
	6 months ago

					More
				  		
  Report
		
					WackyDolphin95
				
					0
					 × 1

Write your answer

673 Views

3 Answers

6 months ago