would it be possible to change de dataset.add_files to some function that moves your files to a common folder (local or cloud), and then use the last step in the dag to create the dataset using that folder?
Hi MagnificentWorm7 ,
I'm not sure I understand. You're trying to upload files to a dataset from different concurrent processes?
MagnificentWorm7 , I'm taking a look if it's possible ๐
As a workaround - I think you could split the dataset into different versions and then use Dataset.squash
to merge into a single dataset
https://clear.ml/docs/latest/docs/references/sdk/dataset#datasetsquash
AbruptCow41 , can you please elaborate? You want to move around files to some common folder and then at the end just create the dataset using that folder?
CostlyOstrich36
I'm taking a look if it's possible
Thank you for response. Dataset.squash
works fine. But squash function squash after download all datasets, so I think it's not proper to me cuz dataset size is huge. I'll try upload at once. BTW, is this a bug? or I did something wrong?
AbruptCow41 Yes, it's possible to do so, but I wanted to upload parallelly if I can and I'm wonder it's a kind of bug.
Iโm suggesting MagnificentWorm7 to do that yes, instead of adding the files to a ClearML dataset in each step
Even I uploaded files name with 001 to 010, only 004, 005, 010 exist on fileserver.
AbruptCow41 , you can already do this, just add the entire folder ๐
Is it possible that it's creating separate datasets? Can you post logs of both processes?
Thatโs why Iโm suggesting him to do that ๐
I'm not sure about how airflow workers run. What I trying to do is upload "different files" to "one clearrml-dataset" in parallel. My dag looks like below, each task from "transform_group " execute clearml-related dataset tasks. Sorry for my bad explanation