AgitatedDove14 I think the major issue is working out how to get the setup of the node dynamically passed to the VMSS so when it creates a node it does the following:
Provisions the correct environment for the clearml-agent
. Installs the clearml-agent
and sets up the clearml.conf
file with the access credentials for the server and file storage. Executes the clearml-agent
on the correct queue, ready for accepting jobs.
In Azure VMSS, there is a method called "Custom Data", which is basically a way of passing things to be executed during the provision of the node. It does this using a system called cloud-init
, which can viewed as a really clever way of modifying and manipulating VM OS's in a programmatic way. However, this sits at a VMSS level, and not a node level. So if you set it, then all nodes will be provisioned with the same execution script. This obviously makes it challenging to dynamically update the clearml.conf
file that we would want to create for each node.
One potential way I can see around this problem is this:
Following the AWS example, it may be possible to update the VMSS object, such that, before the appending of new compute resource, the VMSS custom data is updated with the new node information. That is, node name, credentials, Azure storage credentials etc. This would limit the scaling to 1 node at a time, but that might not be a problem.
I did have an idea about may be environment variables could be used as an alternative, such that all things you would want to dynamically update in the clearml.conf
could perhaps we plucked from OS environment. That would require a way of passing through dynamically these objects to be set at the time provisioning a node, which I would need to investigate.