Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hi All, Juts Learning The Ropes Of Clearml Atm. And Am Doing A Really Simple Etl Pipeline: Raw Data -> Clean Data My Current Approach Is In One Script, I Add The Raw Data File To A Dataset In The Project: # Register_Raw.Ipynb

Hi all,

Juts learning the ropes of ClearML atm. And am doing a really simple ETL pipeline: raw data -> clean data

My current approach is in one script, I add the raw data file to a dataset in the project:

register_raw.ipynb

ds = Dataset.create(
    dataset_name="raw",
    dataset_project="example",
)

ds.add_files(path=local_file_path)

ds.finalize(auto_upload=True)

Then, for the ETL section, I have this approach

clean_data.ipynb

# Make task
task = Task.init(project_name="example", task_name="clean-raw")

# 1. pull raw
raw = Dataset.get(dataset_name="raw")
raw_data = pd.read_parquet(next(Path(raw.get_local_copy()).glob("*.parquet")))

# 2. Clean data
clean_data = raw_data[...]

# 3. Save data
with TemporaryDirectory() as tmp:
    out = Path(tmp) / "cleaned_data.parquet"
    clean_data.to_parquet(out, index=False)

    clean_ds = Dataset.create(
        dataset_name="clean-data",
        dataset_project="example",
        parent_datasets=[raw],
    )
    clean.add_files(out)
    clean.finalize(auto_upload=True) 

But this seems wrong to me to do it this way? It makes two dataset objects (I guess that makes sense) but the new dataset "clean-data" contains both original, and new file. Which was not my intention.

Ideally, what I wanted was a pipeline that saved certain intermediate steps of the process. Is this the canonical way to achieve that?

  
  
Posted 4 months ago
Votes Newest

Answers 3


Hi @<1828965837906644992:profile|WackyDolphin95> , what about not connecting the new dataset to the parent, this way you can have a dataset only with the new files.

  
  
Posted 4 months ago

I wonder, is this stye of data set handling trying to square the circle with ClearML? Is it built for this type of stuff

  
  
Posted 4 months ago

Hi @<1523701070390366208:profile|CostlyOstrich36> - Cheers for your time

I thought about that, but I think the lineage feature is really valuable.

I've opted for this as a go to pattern now to achieve what I wanted. I literally just remove all files in the new dataset before finalizing it

with TemporaryDirectory() as tmp:
    out = Path(tmp) / "df_clean.parquet"
    result.to_parquet(out, index=False)

    clean = Dataset.create(
        dataset_name="clean-data",
        dataset_project="example",
        parent_datasets=[parent],
        use_current_task=True
    )

    for file in clean.list_files():
        clean.remove_files(file)

    clean.add_files(out)
    
    clean.finalize(auto_upload=True)
  
  
Posted 4 months ago
454 Views
3 Answers
4 months ago
4 months ago
Tags