So, what I am referring to is the ability of a system to allow some rigor and robustness of tracking of experiments, and also enforcing some thoughts on how things might be deployed, early on in the development process, whilst not being overly prescriptive and cumbersome
I'm cannot agree more!!
VivaciousPenguin66 We are working on trying to better understand how to solve this very delicate act of balance and offer some sort of "JIRA" for ML.
If this is okay with you, once product people have concrete ideas to share, I'll connect them with you, this is a very exciting topic to try and nail down, and we would love more feedback from the community and real-world use cases.
AgitatedDove14 that started out a lot shorter, and I read it twice, but I think it answers your question..... 😉
AgitatedDove14 ,
Often a question is asked about a data science project at the beginning, which are like "how long will that take?" or "what are the chances it will work to this accuracy?".
To the uninitiated, these would seem like relatively innocent and easy to answer questions. If a person has a project management background, with more clearly defined technical tasks like software development or mechanical engineering, then often work packages and uncertainties relating to outcomes are much less and more clearly defined.
However, anyone who has attempted to get information out of complex and imperfect data, know that the power of any model, and the success of any given project is largely dependent on the data. A lot of the aspects of the data are generally unknown prior to undertaking a project, so the risk at the beginning of any data science project is large. It is large in both a time vs reward point-of-view and a final result point of view, both of which are highly uncertain. The key to successful projects at this point is to rapidly understand the data to a point when you can start to reduce these uncertainties.
In the beginning of the project, you are focused solely on this, and less on quality of code, how easy it is to deploy etc etc. Because of this you cannot be too rigid in how you define process to do work (that is make code) and provide results, as the possible range of outcomes from these processes can be large. It's no surprise that applications like Jupyter Notebooks are so popular, because they provide the ability to code fast and visualize results quickly and inline with the code, as an aid to reduce the lead time to data understanding.
As data scientists we spend a lot of time at that end of the spectrum, looking at data and visualising it in adhoc ways to determine the value and the power of data. The main focus here is understanding, not production ready code. And because less projects make it to deployable models, we as a group are not as experienced at deployment as we are at the beginning bit I describe above. This is likely a key factor in why it takes organisations a lot of work to take development models into production, because the person developing those models isn't really thinking about deployment, or doesn't even have much experience to put things into context during the development phase.
So, what I am referring to is the ability of a system to allow some rigor and robustness of tracking of experiments, and also enforcing some thoughts on how things might be deployed, early on in the development process, whilst not being overly prescriptive and cumbersome that it takes away from the effort to understand the data, is a very valuable thing indeed to have. It balances the need for quick answers at the beginning, with hopefully a considerably easier journey to deployment should a project make it fruition and add value to a particular problem that is being solved.
I find it quite difficult to explain these ideas succinctly, did I make any sense to you?
Yep, I think we are totally on the same wavelength 🙂
However, it also seems to be not too prescriptive,
One last question, what do you mean by that?
So, AgitatedDove14 what I really like about the approach with ClearML is that you can genuinely bring the architecture into the development process early. That has a lot of desirable outcomes, including versioning and recording of experiments, dataset versioning etc. Also it would enforce a bit more structure in project development, if things are required to fit into a bit more of a defined box (or boxes). However, it also seems to be not too prescriptive, such that I would worry that a lot of effort has to go into getting things running above what would be needed during a development cycle.
What we want to achieve is robustness and speed of deployment, but not at the expense of over being overly rigorous and prescriptive in an early development cycle, when most of the effort should be going into the development and understanding of the problem, rather than the mechanics of getting something running that is more deployment friendly. The risk here is that you spend a lot of effort on things that really don't matter if the project doesn't go anywhere. This has to be weighed up against making the process to deployment easier and more efficient. It's a balancing act, but I am starting to see how something like ClearML might tread that fine line and be useful across the range of data science projects, from the very research and development end to the model deployment end.
I find it quite difficult to explain these ideas succinctly, did I make any sense to you?
Sounds good.
BTW, when the clearml-agent is set to use "conda" as package manager it will automatically install the correct cudatoolkit on any new venv it creates. The cudatoolkit version is picked direcly when "developing" the code, assuming you have conda installed as development environment (basically you can transparently do end-to-end conda, and not worry about CUDA at all)
I think I failed in explaining my self, I meant instead of multiple CUDA versions installed on the same host/docker, wouldn't it make sense to just select a different out-of-the-box docker with the right CUDA, directly from the public nvidia dockerhub offering ? (This is just another argument on the Task that you can adjust), wouldn't that be easier for users?
Absolutely aligned with you there AgitatedDove14 . I understood you correctly.
My default is to work with native VM images, and conda environments, and thus, when I wanted a VM with multiple CUDA versions, I created an image which had multiple CUDA versions installed, as well as Conda for environment and package management, and JupyterHub for serving Notebook and Lab.
However, I now realise that serving containers with the specific version of CUDA is the way to go.
What we would like ideally, is a system where development, training, and deployment are almost one and the same thing, to reduce the lead time from development code to production models.
This is very aligned with the goals of ClearML 🙂
I would to understand more on what is currently missing in ClearML so we can better support this approach
my inexperience in using them a lot until recently. I can see how that is a better solution
I think I failed in explaining my self, I meant instead of multiple CUDA versions installed on the same host/docker, wouldn't it make sense to just select a different out-of-the-box docker with the right CUDA, directly from the public nvidia dockerhub offering ? (This is just another argument on the Task that you can adjust), wouldn't that be easier for users?
This is very cool, any reason for not using dockers the multiple CUDA versions?
AgitatedDove14 my inexperience in using them a lot until recently. I can see how that is a better solution and it's something I am actively getting trying to improve my understanding of, and use of.
I am now relatively comfortable with producing a Dockerfile
for example, although I've not got as far as making any docker-compose
related things yet.
What I really like about ClearML is the potential for capturing development at an early stage, as it requires only minimal adjustment of code for it be in the very least captured as an experiment, even if it is run locally on ones machine.
What we would like ideally, is a system where development, training, and deployment are almost one and the same thing, to reduce the lead time from development code to production models. Removing as many translation layers as you can between the development and the serving process means things are easier to maintain, and when things go wrong, you have less degrees of freedom to consider. Machine learning models are complex beasts, as you have to consider the model parameters, the data and all the pre-processing, and then if you have additional translation layers for deployment, all of those have to be considered before diagnosing any problems.
My view is that if you can make those layers as few and as transparent as possible, as well as allowing very easy comparison between experiments (and that's everything, model, data, code, environment etc.), then hopefully you can very quickly identify things that have changed, and where to investigate if a model is not performing as expected.
I should say, the company I am working Malvern Panalytical, we are developing an internal MLOps capability, and we are starting to develop a containerized deployment system, for developing, training and deploying machine learning models. Right now we are at the early stages of development, and our current solution is based on using Azure MLOps, which I personally find very clunky.
So I have been tasked with investigating alternatives to replace the training and model deployment side of things.
The likely solution will involve the use of Prefect for containerized pipelines, and then interfacing with various systems, for example, ClearML, for better handling of model development, training and deployment.
Hopefully once things calm down at work I will find more time.
Sounds good 🙂
I made a custom image for the VMSS nodes, which is based on Ubuntu and has multiple CUDA versions installed, as well as conda and docker pre-installed.
This is very cool, any reason for not using dockers the multiple CUDA versions?
So I've been testing bits and pieces individually.
For example, I made a custom image for the VMSS nodes, which is based on Ubuntu and has multiple CUDA versions installed, as well as conda and docker pre-installed.
I'm managed to test the setup script, so that it executes on a pristine node, and results in a compute node being added to the relevant queue, but that's been executed manually by me, as I have the credentials to log on via SSH.
And I had to do things get the clearml-server the manual way.
Hopefully once things calm down at work I will find more time.
I think so.
I am doing this with one hand tied behind my back at the moment because I waiting to get an Azure AD App and Services policy setup, to enable the autoscaler to authenticate with the Azure VMSS via the Python SDK.
So when the agent fire up it get's the hostname, which you can then get from the API,
I think it does something like "getlocalhost", a python function that is OS agnostic
Looking at the
supervisor
method of the base
AutoScaler
class, where are the worker IDs kept.
Is it in the class attribute
queues
?
Actually the supervisor is passing a fixed prefix, then it asks the clearml-server on workers starting with this name.
This way we can have a fixed init script for all agents, while we still can differentiate them from the other agent instances in the system. Make sense ?
Oh cool!
So when the agent fire up it get's the hostname, which you can then get from the API, and pass it back to take down a specific resource if it is deemed idle?
However, that would mean passing back the hostname to the Autoscaler class.
Sorry my bad, the agent does that automatically in real-time when it starts, no need to pass the hostname it takes it from the VM (usually they have some random number/id)
Yes, there's an internal provisioning for it from the Azure VMSS.
However, that would mean passing back the hostname to the Autoscaler class.
Right now as it's written, the spin_up_worker
method doesn't update the class attributes. Following the AWS example that is also the case, where I can see it merely takes the arguments given, such as worker id, and constructs a node with those parameters e.g. hostname etc.
Looking at the supervisor
method of the base AutoScaler
class, where are the worker IDs kept.
Is it in the class attribute queues
?
So if you set it, then all nodes will be provisioned with the same execution script.
This is okay in a way, since the actual "agent ID" is by default set based on the machine hostname, which I assume is unique ?
In Azure VMSS, there is a method called "Custom Data", which is basically a way of passing things to be executed
I know that it is in the to do list to add "azure_autoscaler" which is basically asybling to the aws_autoscaler.
With the same idea of the "custom data" as initial bash script:
You can check here:
https://github.com/allegroai/clearml/blob/4a2099b53c09d1feaf0e079092c9e075b43df7d2/clearml/automation/aws_auto_scaler.py#L54
AgitatedDove14 I think the major issue is working out how to get the setup of the node dynamically passed to the VMSS so when it creates a node it does the following:
Provisions the correct environment for the clearml-agent
. Installs the clearml-agent
and sets up the clearml.conf
file with the access credentials for the server and file storage. Executes the clearml-agent
on the correct queue, ready for accepting jobs.
In Azure VMSS, there is a method called "Custom Data", which is basically a way of passing things to be executed during the provision of the node. It does this using a system called cloud-init
, which can viewed as a really clever way of modifying and manipulating VM OS's in a programmatic way. However, this sits at a VMSS level, and not a node level. So if you set it, then all nodes will be provisioned with the same execution script. This obviously makes it challenging to dynamically update the clearml.conf
file that we would want to create for each node.
One potential way I can see around this problem is this:
Following the AWS example, it may be possible to update the VMSS object, such that, before the appending of new compute resource, the VMSS custom data is updated with the new node information. That is, node name, credentials, Azure storage credentials etc. This would limit the scaling to 1 node at a time, but that might not be a problem.
I did have an idea about may be environment variables could be used as an alternative, such that all things you would want to dynamically update in the clearml.conf
could perhaps we plucked from OS environment. That would require a way of passing through dynamically these objects to be set at the time provisioning a node, which I would need to investigate.
think perhaps it came across as way more passive aggressive than I was intending.
Dude, you are awesome for saying that! no worries 🙂 we try to assume people have the best intention at heart (the other option is quite depressing 😉 )
I've been working on a Azure load balancer example, ...
This sounds exciting, let me know if we can help in any way
AgitatedDove14 apologies, I read my previous message, I think perhaps it came across as way more passive aggressive than I was intending. Amazing how missing a few words from a sentence can change the entire meaning! 😀
What I meant to say was, it's going to be a busy few months for us whilst we move house, so I didn't want to say I'd contribute and then disappear for two months!
I've been working on a Azure load balancer example, heavily based on the AWS example. The load balancing part is relatively straight forward using the Azure Python SDK, which can connect to an existing Virtual Machine Scale Set, and request new nodes to be scaled in or out. but unfortunately I cannot get the user permissions yet on our Azure subscription for adding an AD app and service principal. This is the way that Azure allows authentication of a service, and it does this by giving an AD app and service principal certain user roles, which allow it to control resources. In the case of the load balancer, it is needed to allow the load balancer to connect to a Virtual Machine Scale Set and scale the cluster up or down. I've got the bash script for the node setup more less right I think, but without those access permissions I am not able to test it all together. Of course, it needs to be executed on the server head so that the ClearML credentials can be passed, but that server also needs the credentials for connecting with the Azure system to control the cluster.
I am just about to move house, which is stressful enough without a global pandemic(!), so until that's completed I won't commit to anything.
Sure man 🙂 no rush, I appreciate the gesture regardless of the outcome
Many thanks!
AgitatedDove14 I would love to help the project.
I am just about to move house, which is stressful enough without a global pandemic(!), so until that's completed I won't commit to anything. However, once settled in the new place, and I have a bit more time, I would very much welcome contributing.