Sure, with clearml
and clearml-agent
you get autoscaling for your machines (with monitoring) and automation for your tasks that will handle all for you (docker images, manage the credentials …).
There are much more parts in the system, so maybe you can share a use case so I can help you with it?
Hi FloppyDeer99 ,
after pulling it from a queue in agent
What do you mean exactly?
Hi, SuccessfulKoala55 Maybe this is not a good question. In other words, docs introduce that ClearML Open Source supports orchestration, how can I found the relating codes? And what the role of clearml-agent in orchestration, a combination of kube-scheduler and kubelet?
Hi FloppyDeer99 ,
In other words, docs introduce that ClearML Open Source supports orchestration, how can I found the relating codes?
You can find many examples https://clear.ml/docs/latest/docs/getting_started/mlops/mlops_first_steps/ , if you have a specific use case you want to check, please share and I can send an example of it.
And what the role of clearml-agent in orchestration, a combination of kube-scheduler and kubelet?
ClearML agent is an ML-Ops tool for users to run jobs on any machine, with or without any changes (full arch https://clear.ml/docs/latest/docs/getting_started/architecture/ )
Hi TimelyPenguin76 , thanks for you reply. I still confuse about ClearML’s orchestration. Could you describe it in detail?
For example, I have a lot of tasks in a queue, but there is a default agent(services, mode = daemon) after using helm to deploy the ClearML in Kubernetes. As far as I know, agent will pull and execute the task in specified queue one by one, which results in other tasks are blocked. However, the physical resource is enough in cluster. So how can ClearML to improve the efficiency of execution? I can solve it by implementing a custom scheduler which is used to watch the queue, pull the task and send it to a remote environment. This environment will prepare the prerequisites and run agent in execute mode. But docs introduce the ClearML Open Source has orchestration feature. I want to find out whether the orchestration support the above situation?
Hi FloppyDeer99 ,
It depends on you setup:
if you have on prem machines, you can start more than one clearml-agent on the machine with the resources and assign for example each gpu on the machine to a https://clear.ml/docs/latest/docs/clearml_agent#docker-mode . You can have the same for cloud machine, and if you are using the AWS you can run the https://clear.ml/docs/latest/docs/guides/services/aws_autoscaler/ as a service. K8S: there is a great example for k8s glue https://github.com/allegroai/clearml-agent/blob/master/examples/k8s_glue_example.py
I can solve it by implementing a custom scheduler which is used to watch the queue, pull the task and send it to a remote environment. This environment will prepare the prerequisites and run agent in execute mode.
Each of the above should do it for you 🙂
Thanks TimelyPenguin76 , I think I got it, but I have new questions when reading README of ClearML Agent. And I gave my questions in community just now, could you have a look at them? https://clearml.slack.com/archives/CTK20V944/p1626251661217200