Hi, We Are Encountering An Increasing Number Of Cases Where It Takes Quite A While Before Actual Training (Gpu Utilisation) Can Be Done. After Observing, This Is What We Discovered. The Following Are The Steps And Bottlenecks.

Unanswered

Hi, we are encountering an increasing number of cases where it takes quite a while before actual training (GPU utilisation) can be done. After observing, this is what we discovered. The following are the steps and bottlenecks.

Job submitted to ClearML
ClearML spawns K8S pod via k8sGlue (Within 30 secs)
Pod setup and runs script (Take up to 5 mins)
Script uses ClearML-data to pull versioned dataset (Took 30 mins due to size of dataset)- backend is S3, but is it suitable/compatible with Clearml-Data data pulling strategies? - Batch/Preprocess/Train
Questions i am asking now are;
Are there other best practices in the data pulling part, especially for experiments that are using exact same dataset? (E.g. Cache?)
What kind of storage should we use with ClearML? (Today is via S3 )

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					SubstantialElk6
				
					0
					 × 1

Write your answer

1K Views

0 Answers

2 years ago