What I still don't get, is how you would create different queues, targeting different nodes with different GPUs, and having them using the appropriate Cuda image.
Looking at the template, I don't understand how that's possible.
Thanks, I'll try to understand how the default agent coming with the helm chart is configured and try to copy how to setup a different one from there then
I guess to achieve what I want, I could disable the agent using the helm chart values.yaml
and then define pods for each of the agent on their respective nodes
Correct, (if this is running on k8s it is most likely be passed via env variables , CLEARML_WEB_HOST etc,)
SarcasticSquirrel56
if I configure manually the pods for the different nodes, how do I make clearml server aware that those agents exist?
Basically the agent register themselves on your cleaml-server, and they register on which Queue(s) they listen to. In other words the interface to choose the different types of machines/gpus is by enqueue the Task to different queues.
For example: Queue(1): "CUDA11_GPUx1" , Queue(2): "CUDA10_GPUx1"
Make sense ?
EDIT:
I guess to achieve what I want, I could disable the agent using the helm chart values.yaml
and then define pods for each of the agent on their respective nodes
It might be the case, I have to admit I can't remember how flexible the helm chart is in this manner ...
Hi AgitatedDove14 I have spent some time going through the helm charts but I admit I still haven't clear how things should work.
I see that with the default values (mostly what I am using), the K8s Glue agent is deployed (which is what you suggested to use).
Thanks Martin, so if I understand correctly, when I do the clearml-agent init command (I have to check the syntax), by providing the apiserver webeserver and fileserver url they'll be registered to the clearml cluster?
Hi Martin, thanks. My doubt is:
if I configure manually the pods for the different nodes, how do I make clearml server aware that those agents exist? This step is really not clear to me from the documentation (it talks about user, and it uses interactive commands which would mean entering in the agents manually) I will try also the k8s glue, but I would like first to understand how to configure a fixed number of agents manually
Right now I see the default agent that comes with the helm chart...
is how you would create different queues,
SarcasticSquirrel56 you can create them from the UI, when the server is already running
(if you are saying, how do I create them in the first installaiton, then yes you are correct, this is possible in the helm chart, I think 😞 )
So that agent on different nodes will probably require different cuda-version images.
That makes sense SarcasticSquirrel56
I would edit the helm chart (or deploy manually) based on a selector that will select the different nodes/gpus and assign the correct containers (i.e. matching CUDA versions to the diff GPUs / drivers)
BTW: you can also playaround with k8s glue, which would dynamically spin pods based on clearml Tasks.
wdyt?