Hey Everyone, As A Pro-Tier Saas User, I'M Experiencing A Very High Latency When Finalizing A Dataset, It Is Attached In A Big Dataset Version Hierarchy And Since Recently The

Answered

Hey everyone,

As a Pro-tier SaaS user, I'm experiencing a very high latency when finalizing a dataset, it is attached in a big dataset version hierarchy and since recently the finalize() execution is taking ~10mins to complete, might there be some big recursive diff operation taking all that time ?

Here's a quick overview of the code:

last_dataset = clearml.Dataset.get(
    dataset_project='MyProject',
    dataset_name='DatasetPreTraining',
    auto_create=True
)

if not last_dataset.is_final():
    dataset = last_dataset
else:
    dataset = clearml.Dataset.create(
        dataset_project='MyProject',
        dataset_name='DatasetPreTraining',
        parent_datasets=[last_dataset.id],
    )

dataset.add_files(constants.TRANSFORMED_DATA_FILE)

dataset.upload()
dataset.finalize()

  				
Posted 
	one year ago

					More  		
  Report
		
					FierceHamster54
				
					0
					 × 1

Votes Newest

Answers 7

Hi FierceHamster54 , how big is the version hierarchy? Can you provide some details on the structure? Also, how many files are in the dataset and what are their sizes?

  				
Posted 
	one year ago

					More  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Hey SuccessfulKoala55 this is a fairly small dataset with a linear hierarchy of ~300 version and a size of ~2GBs

  				
Posted 
	one year ago

					More  		
  Report
		
					FierceHamster54
				
					0
					 × 1

Hi FierceHamster54 ! Looks like we pull all the ancestors of a dataset when we finalize. I think this can be optimized. We will keep you posted when we make some improvements

  				
Posted 
	one year ago

					More  		
  Report
		
					SmugDolphin23
				
					0

Thanks a lot SmugDolphin23 ❤

  				
Posted 
	one year ago

					More  		
  Report
		
					FierceHamster54
				
					0
					 × 1

In the meantime is there some way to set a retention policy for the dataset versions ?

  				
Posted 
	one year ago

					More  		
  Report
		
					FierceHamster54
				
					0
					 × 1

Or do I have to add pipeline step to prune ancestors that are too old ?

  				
Posted 
	one year ago

					More  		
  Report
		
					FierceHamster54
				
					0
					 × 1

pruning old ancestors sounds like the right move for now.

  				
Posted 
	one year ago

					More  		
  Report
		
					SmugDolphin23
				
					0

Write your answer

1K Views

7 Answers

one year ago