Unanswered
Hi, We Are Encountering An Increasing Number Of Cases Where It Takes Quite A While Before Actual Training (Gpu Utilisation) Can Be Done. After Observing, This Is What We Discovered. The Following Are The Steps And Bottlenecks.
Hi, we are encountering an increasing number of cases where it takes quite a while before actual training (GPU utilisation) can be done. After observing, this is what we discovered. The following are the steps and bottlenecks.
- Job submitted to ClearML
- ClearML spawns K8S pod via k8sGlue (Within 30 secs)
- Pod setup and runs script (Take up to 5 mins)
- Script uses ClearML-data to pull versioned dataset (Took 30 mins due to size of dataset)- backend is S3, but is it suitable/compatible with Clearml-Data data pulling strategies? - Batch/Preprocess/Train
Questions i am asking now are; - Are there other best practices in the data pulling part, especially for experiments that are using exact same dataset? (E.g. Cache?)
- What kind of storage should we use with ClearML? (Today is via S3 )
833 Views
0
Answers
one year ago
one year ago
Tags