Answered

Hi, I'M Trying To Upload Data To Clearml Parallelly. Is It Impossible To Use

Hi, I'm trying to upload data to clearml parallelly. Is it impossible to use dataset.upload() for one target dataset at the same time using several python client(multi processing or thread)? When I'm try to do, all uploading task finish normally with no errors but there's some missing files on the server in the end.
` # my example code

Create dataset

dataset = Dataset.create(
dataset_name="test", dataset_project="test_project"
)

client A: upload file A, B, C to dataset name "test"

dataset = Dataset.get(dataset_name="test", dataset_project="test_project")
dataset.add_files(
f"/path/A", local_base_folder="/path"
)
dataset.upload()

client B: upload file D, E, F to dataset name "test"

...

Same with client A

...

dataset.finalize()

Result on web (example)

Dataset content : only A, C, E exist (missing B, D, F) It's not just a webserver error, there's no file after I download dataset using Dataset.get.get_mutable_local_copy() `
Is there any problem with my usage? Thanks.

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					MagnificentHamster7
				
					0
					 × 1

Votes Newest

Answers 12

Hi MagnificentWorm7 ,

I'm not sure I understand. You're trying to upload files to a dataset from different concurrent processes?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					CostlyOstrich36
				
					0

Even I uploaded files name with 001 to 010, only 004, 005, 010 exist on fileserver.

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					MagnificentHamster7
				
					0
					 × 1

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					MagnificentWorm7
				
					0

AbruptCow41 , you can already do this, just add the entire folder 🙂

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					CostlyOstrich36
				
					0

CostlyOstrich36

I'm taking a look if it's possible

Thank you for response. Dataset.squash works fine. But squash function squash after download all datasets, so I think it's not proper to me cuz dataset size is huge. I'll try upload at once. BTW, is this a bug? or I did something wrong?
AbruptCow41 Yes, it's possible to do so, but I wanted to upload parallelly if I can and I'm wonder it's a kind of bug.

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					MagnificentHamster7
				
					0
					 × 1

Is it possible that it's creating separate datasets? Can you post logs of both processes?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					CostlyOstrich36
				
					0

would it be possible to change de dataset.add_files to some function that moves your files to a common folder (local or cloud), and then use the last step in the dag to create the dataset using that folder?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AbruptCow41
				
					0
					 × 1

I’m suggesting MagnificentWorm7 to do that yes, instead of adding the files to a ClearML dataset in each step

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AbruptCow41
				
					0
					 × 1

AbruptCow41 , can you please elaborate? You want to move around files to some common folder and then at the end just create the dataset using that folder?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					CostlyOstrich36
				
					0

That’s why I’m suggesting him to do that 🙂

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AbruptCow41
				
					0
					 × 1

I'm not sure about how airflow workers run. What I trying to do is upload "different files" to "one clearrml-dataset" in parallel. My dag looks like below, each task from "transform_group " execute clearml-related dataset tasks. Sorry for my bad explanation

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					MagnificentHamster7
				
					0
					 × 1

MagnificentWorm7 , I'm taking a look if it's possible 🙂
As a workaround - I think you could split the dataset into different versions and then use Dataset.squash to merge into a single dataset
https://clear.ml/docs/latest/docs/references/sdk/dataset#datasetsquash

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					CostlyOstrich36
				
					0

Write your answer

2K Views

12 Answers

3 years ago

one year ago