Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hi, I'M Trying To Upload Data To Clearml Parallelly. Is It Impossible To Use

Hi, I'm trying to upload data to clearml parallelly. Is it impossible to use dataset.upload() for one target dataset at the same time using several python client(multi processing or thread)? When I'm try to do, all uploading task finish normally with no errors but there's some missing files on the server in the end.
` # my example code

Create dataset

dataset = Dataset.create(
dataset_name="test", dataset_project="test_project"
)

client A: upload file A, B, C to dataset name "test"

dataset = Dataset.get(dataset_name="test", dataset_project="test_project")
dataset.add_files(
f"/path/A", local_base_folder="/path"
)
dataset.upload()

client B: upload file D, E, F to dataset name "test"

...

Same with client A

...

dataset.finalize()

Result on web (example)

Dataset content : only A, C, E exist (missing B, D, F) It's not just a webserver error, there's no file after I download dataset using Dataset.get.get_mutable_local_copy() `
Is there any problem with my usage? Thanks.

  
  
Posted 2 years ago
Votes Newest

Answers 12


would it be possible to change de dataset.add_files to some function that moves your files to a common folder (local or cloud), and then use the last step in the dag to create the dataset using that folder?

  
  
Posted 2 years ago

Hi MagnificentWorm7 ,

I'm not sure I understand. You're trying to upload files to a dataset from different concurrent processes?

  
  
Posted 2 years ago

MagnificentWorm7 , I'm taking a look if it's possible ๐Ÿ™‚
As a workaround - I think you could split the dataset into different versions and then use Dataset.squash to merge into a single dataset
https://clear.ml/docs/latest/docs/references/sdk/dataset#datasetsquash

  
  
Posted 2 years ago

CostlyOstrich36

I'm taking a look if it's possible

Thank you for response. Dataset.squash works fine. But squash function squash after download all datasets, so I think it's not proper to me cuz dataset size is huge. I'll try upload at once. BTW, is this a bug? or I did something wrong?
AbruptCow41 Yes, it's possible to do so, but I wanted to upload parallelly if I can and I'm wonder it's a kind of bug.

  
  
Posted 2 years ago

Iโ€™m suggesting MagnificentWorm7 to do that yes, instead of adding the files to a ClearML dataset in each step

  
  
Posted 2 years ago

Even I uploaded files name with 001 to 010, only 004, 005, 010 exist on fileserver.

  
  
Posted 2 years ago

AbruptCow41 , you can already do this, just add the entire folder ๐Ÿ™‚

  
  
Posted 2 years ago

Is it possible that it's creating separate datasets? Can you post logs of both processes?

  
  
Posted 2 years ago

I'm not sure about how airflow workers run. What I trying to do is upload "different files" to "one clearrml-dataset" in parallel. My dag looks like below, each task from "transform_group " execute clearml-related dataset tasks. Sorry for my bad explanation

  
  
Posted 2 years ago

Thatโ€™s why Iโ€™m suggesting him to do that ๐Ÿ™‚

  
  
Posted 2 years ago

AbruptCow41 , can you please elaborate? You want to move around files to some common folder and then at the end just create the dataset using that folder?

  
  
Posted 2 years ago
1K Views
12 Answers
2 years ago
8 months ago
Tags