Hi WackyRabbit7
Currently we don’t have a GCP auto scalar. We’re more than happy to get contributions for GCP and other platforms.
The AWS auto-scaler code is pretty generic and in order to be used for GCP, you need to implement a GCPAutoScaler
similar to trains.automation.aws_auto_scaler.AWSAutoScaler
, which basically has spin_up_worker()
and spin_down_worker()
methods...
I'd be lying if I said I had time for that 🙂
This is a nice issue to open in https://github.com/allegroai/trains :)
Hi TimelyPenguin76 ! quick question - I was curious if at any point in time ClearML had considered using AWS Batch for the autoscaling part ? As submitting "training jobs" to AWS batch would create/terminate ec2 instances automatically too instead of clearml writing logic to spin up/down instances
Hi BrightElephant64 , can you add an example? Also, the ClearML AWS autoscaler know how to work with ClearML-agent queues
Ah ok ! I think the link between clearml agent queues and the autoscaler is also important for resource monitoring etc. Sorry I'm quite new to clearml so I'm still trying to understand the architecture 🙈 . Although AWS Batch has its own queue system too which it uses to allocate jobs on training instances but its not customizable