Two Questions About Datasets: Question 1: Are Parallel Writes To A Dataset With The Same Version Possible? Is The Way To Go, To Have A Task, Which Creates A Dataset Object, Which In Turn Is Passed As Artifact To The Subsequent Ingestion Tasks? After The P

Answered

Two questions about datasets:
question 1: are parallel writes to a dataset with the same version possible? Is the way to go, to have a task, which creates a dataset object, which in turn is passed as artifact to the subsequent ingestion tasks? After the parallel ingestion, is it possible, to finalize the dataset creation in a follow up task? Is that the way to go?
question 2: If a dataset has been created, files have been added and the dataset has been finalized. Whats the recommended way to append the dataset in a future version? Should the Dataset.get(...).get_local_copy() than create a new dataset and add the file of the local copy and the new files to the dataset and finalize it, or should I add the new files to the directory, where the files of the dataset have been copied to, and call sync? I guess in that case I would to have to call get_mutual_local_copy(). In second case, I guess, only references are passed for the old files, whereas in the first scenario, all files would be added as files, which might blow up the storage. Or should I add child datasets as proposed in (urbansound_sample) and (MNIST sample) ?

  				
Posted 
	one year ago

					More
				  		
  Report
		
					SaltySpider22
				
					0
					 × 1

Votes Newest

Answers 6

Hi @<1661542579272945664:profile|SaltySpider22>

question 1: are parallel writes to a dataset with the same version possible?

When you are saying parallel what do you mean? from multiple machines ?

Whats the recommended way to append the dataset in a future version?

Once a dataset was finalized the only way to add files is to add another version that inherits from the previous one (i.e. the finalized version becomes the parent of the new version)
If you are worried about multiple versions, just like in git you have squeeze 🙂

passing Dataset artifacts between tasks seems to be not possible,

The correct way would be to pas the Dataset ID, then other task would simple get it with Dataset.get
No need to worry about re-download, everything is automatically cached.
Make sense ?

  				
Posted 
	one year ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Hi @<1661542579272945664:profile|SaltySpider22> I'm not sure I understand the answer to my parallel quesion

  				
Posted 
	one year ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

yes, or (because I deployed clearml using helm in kubernetes) from the same machine, but multiple pods (tasks).

Oh now I see, long story short, no 😞 the correct way of doing that is every node/pod creates it's own dataset,
then when you are done, you create a new version with the X datasets that you created as parents, the newly created version is just "meta" it basically tells the system how to combine the previously generated datasets (i.e. no data is actually re-uploaded).
Version tree should look something like

 [x]
  |
+-+--+---+
|    |   |
[a] [b] [c]

  				
Posted 
	one year ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

to question 1:
passing Dataset artifacts between tasks seems to be not possible, getting the following error message:

TypeError: cannot pickle '_thread.lock' object.

So i guess its not possible to upload files from different tasks in parallel to the dataset, before finalizing it.

  				
Posted 
	one year ago

					More
				  		
  Report
		
					SaltySpider22
				
					0
					 × 1

Hey @<1523701205467926528:profile|AgitatedDove14> ,
sorry, I am quite new to slack... forgot to submit my changes of the answer...

When you are saying parallel what do you mean? from multiple machines ?

yes, or (because I deployed clearml using helm in kubernetes) from the same machine, but multiple pods (tasks).

Once a dataset was finalized the only way to add files is to add another version that inherits from the previous one (i.e. the finalized version becomes the parent of the new version)
If you are worried about multiple versions, just like in git you have squeeze

okay, great. thank you so much!

The correct way would be to pas the Dataset ID, then other task would simple get it with Dataset.get
No need to worry about re-download, everything is automatically cached.

Sounds good, thanks for clarification.

  				
Posted 
	one year ago

					More
				  		
  Report
		
					SaltySpider22
				
					0
					 × 1

@<1523701205467926528:profile|AgitatedDove14>

When you are saying parallel what do you mean? from multiple machines ?

  				
Posted 
	one year ago

					More
				  		
  Report
		
					SaltySpider22
				
					0
					 × 1

Write your answer

2K Views

6 Answers

one year ago