Hi AgitatedTurtle16 ,
In the https://github.com/allegroai/clearml-server-k8s , you can find examples for ClearML Agent deployment both as a simple, single service (as part of the https://github.com/allegroai/clearml-server-k8s/tree/master/clearml-server-chart , see https://github.com/allegroai/clearml-server-k8s/blob/master/clearml-server-chart/templates/clearml-agent-deployment.yaml ), or using a more scalable Agent Group approach (in the https://github.com/allegroai/clearml-server-k8s/tree/master/clearml-server-cloud-ready , see https://github.com/allegroai/clearml-server-k8s/blob/master/clearml-server-cloud-ready/templates/deployment-agent.yaml ) - is that what you were looking for?
Sorry but no, i already have clearml agent running as a pod. My question is how to use it to manage my experiments (docker containers). Simply put, let's say:
I have an an experiment ( some code in Tensorflow) I containerized my code inside a docker container -inside the container already set the credentials to my clearml server (i can see logs, plots artifacts etc etc)
Now i am using Tfjobs to run my experiment in the cluster ( https://www.kubeflow.org/docs/components/training/tftraining/ ) My question is how can i make use of clearml agent in this situation to schedule these experiments using queues etc, because we have hundreds of experiments from different teams and have multiple resources (CPUs, DGX A100, MIGs etc). I want to use clearml agent to manage all of that if possible. But i couldn't really understand how to do it.
Hi AgitatedTurtle16
My question is how to use it to manage my experiments (docker containers). Simply put, let's say:
So basically once you see an experiment in the UI, it means you can launch it on an agent.
There is No need to containerize your experiment (actually that's kind of the idea, removing the need to always containerize everything).
The agent will clone the code, apply uncommitted changes & install the packages in the base-container-image at runtime.
This allows you to use off-the-shelf containers, and not worry about anything.
Make sense ?
AgitatedDove14 Hello, actually no. If i can have a concrete example on how to do it it would be helpful.
For instance:
"So basically once you see an experiment in the UI, it means you can launch it on an agent."
But once i see it on the UI means it is already launched somewhere so i didn't quite get you.
Also, I want to launch my experiments on a kubernetes cluster and i don't actually have any docs on how to do that, so an example can be helpful here. So my use case is anyone of my team sitting on his laptop can submit jobs to a remote kubernetes cluster, i want to be able to use an agent to take all these jobs and run launch them on the cluster. I can use gitlabCI for that for example.
But once i see it on the UI means it is already launched somewhere so i didn't quite get you.
The idea is you run it locally once (think debugging your code, or testing it)
While running the code the Task is automatically created, then once in the system you can clone / launch it.
Also, I want to launch my experiments on a kubernetes cluster and i don't actually have any docs on how to do that, so an example can be helpful here.
We are working on documenting the full process, I'm hoping to see something in the next week or so.
Are you running Kubernetes as a serice ? or an on-prem Kubernetes ?
So my use case is anyone of my team sitting on his laptop can submit jobs to a remote kubernetes cluster, ...
Yes this is exactly the scenario ClearML supports 🙂
We use both we have our on prem cluster, and we have old clusters on GKE. Having it documented would a much help for me.
For the on-prem you can check the k8s helm charts it case spin agents for you (static agents).
For the GKE the best solution is the k8s glue:
https://github.com/allegroai/clearml-agent/blob/master/examples/k8s_glue_example.py