Unanswered
Hi All, I'M Training A Model Using Aws Sagemaker And Monitoring With A Clearml Server On-Prem. Works Well Enough When The Training Is Split (Horovod - With A Task On Each Rank). But When I Try And Spawn Eval Jobs To Run On Different Aws Machines, It Seems
IrateDolphin19 , can you give a bit of an explanation on how and what you're doing, and what on the clearml
side seems to fail - how do you create the tasks and manage them...
172 Views
0
Answers
2 years ago
one year ago