Hi MagnificentWorm7 ,
I'm not sure I understand. You're trying to upload files to a dataset from different concurrent processes?
would it be possible to change de dataset.add_files to some function that moves your files to a common folder (local or cloud), and then use the last step in the dag to create the dataset using that folder?
AbruptCow41 , can you please elaborate? You want to move around files to some common folder and then at the end just create the dataset using that folder?
Iโm suggesting MagnificentWorm7 to do that yes, instead of adding the files to a ClearML dataset in each step
Thatโs why Iโm suggesting him to do that ๐
CostlyOstrich36
I'm taking a look if it's possible
Thank you for response. Dataset.squash
works fine. But squash function squash after download all datasets, so I think it's not proper to me cuz dataset size is huge. I'll try upload at once. BTW, is this a bug? or I did something wrong?
AbruptCow41 Yes, it's possible to do so, but I wanted to upload parallelly if I can and I'm wonder it's a kind of bug.
I'm not sure about how airflow workers run. What I trying to do is upload "different files" to "one clearrml-dataset" in parallel. My dag looks like below, each task from "transform_group " execute clearml-related dataset tasks. Sorry for my bad explanation
Even I uploaded files name with 001 to 010, only 004, 005, 010 exist on fileserver.
AbruptCow41 , you can already do this, just add the entire folder ๐
MagnificentWorm7 , I'm taking a look if it's possible ๐
As a workaround - I think you could split the dataset into different versions and then use Dataset.squash
to merge into a single dataset
https://clear.ml/docs/latest/docs/references/sdk/dataset#datasetsquash
Is it possible that it's creating separate datasets? Can you post logs of both processes?