Reputation
Badges 1
137 × Eureka!This is the list of all the environment variables (starting with CLEARML) available in the Pod spawned by the K8s Glue Agent:
` CLEARML_MONGODB_PORT_27017_TCP_PORT
CLEARML_FILESERVER_PORT_8081_TCP_ADDR
CLEARML_ELASTIC_MASTER_PORT_9200_TCP
CLEARML_APISERVER_PORT_8008_TCP_PROTO
CLEARML_FILESERVER_PORT_8081_TCP_PORT
CLEARML_ELASTIC_MASTER_SERVICE_PORT_TRANSPORT
CLEARML_WEBSERVER_PORT_80_TCP
CLEARML_ELASTIC_MASTER_SERVICE_PORT
CLEARML_MONGODB_PORT_27017_TCP_ADDR
CLEARML_FILESERVER_PORT_8081_TCP_P...
What I still don't get, is how you would create different queues, targeting different nodes with different GPUs, and having them using the appropriate Cuda image.
Looking at the template, I don't understand how that's possible.
Thanks, I'll try to understand how the default agent coming with the helm chart is configured and try to copy how to setup a different one from there then
the experiment is supposed tu run in this queue, but then it hangs in a pending scheduler
At this point, I see a new queue in the UI:
especially for datasets (for the models and other files we were thinking to use the fileserver any way)
I see... because the problem it would be with permissions when creating artifacts to store in the "/shared" folder
well there are already processes in place.. we aim at migrating everything to ClearML, but we hoped we could do it gradually
I am not aware of how clearml-dataset works, but I'll have a look 🙂
If now I abort the experiment (which is in a pending state and not running), and re-enqueue it again -- no parameters modifications this time...
and I re-enqueue it to the CPU queue, I see that it is sent to the right queue, and after a few seconds the job enters a running state and it completes correctly
Thanks, in DM I sent you the conf we use to deploy the agents.
Hi Jack, yes we had to customize the default one for some tools we use internally
If you need to know the value of some of them let me know CostlyOstrich36 I wanted to avoid leaking access keys etc. so I removed the values
And yes, I am using the agents that come with the Helm chart from Clearml repository
I guess to achieve what I want, I could disable the agent using the helm chart values.yaml
and then define pods for each of the agent on their respective nodes
Hi Martin, I admit I don't know about MIG I'll have to ask some of our engineers.
As for the memory, yes the reasoning is clear, the main thing we'll have to see is hot define the limits, because we have nodes with quite different resources available, and this might get tricky, but I'll try and let's see what happens 🙂
We actually plan to create different queues for different types of workloads, we are a bit seeing what the actual usage is to define what type of workloads make sense for us.
My understanding is that in Task.init, you have a reuse_last_task_id (or similar name) that defaults to True.. In that case if your experiment wasn't "published" it will be overwritten, (based on project and experiment name). However, if you do publish it, a new experiment would be created
I have tried this several time and the behaviour is always the same. It looks like when I modify some hyperparameter, when I enqueue the experiment to one queue, things don't work if I didn't make sure to have previously set the value of k8s-queue to the name of the queue that I want to use. If I don't modify the configuration (e.g. I abort, or reset the job and enqueue it again, or clone and enqueue it without modifying the hyperparameters) then everything works as expected.
Hi Martin, thanks for the explanation! I work with Maggie and help with the ClearML setup.
Just to be sure, currently, the PodTemplate contains:
resources: limits: nvidia.com/gpu: 1
you are suggesting to add also, something like:requests: memory: "100Mi" limits: memory: "200Mi"
is that correct?
On a related note, I am a bit puzzled by the fact that all the 4 GPUs are visible.
In the https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/ , i...
I see in bitnami's gh-pages branch a file https://github.com/bitnami-labs/sealed-secrets/blob/gh-pages/index.html to do the redirect that contains:
` <html>
<head> <meta http-equiv="refresh" content="0; url= ` ` "> </head> <p><a href=" ` ` ">Redirect to repo index.yaml</a></p> </html> ` A similar file is missing in the ` clearml-helm-chart ` ` gh-pages ` branch.
Hi AgitatedDove14 I have spent some time going through the helm charts but I admit I still haven't clear how things should work.
I see that with the default values (mostly what I am using), the K8s Glue agent is deployed (which is what you suggested to use).
I actually found out it was an indentation error 😅 and the credentials weren't picked
But shouldn't the path of the artifacts be a setting of the file server, and not of the agent?
Thanks Martin, so if I understand correctly, when I do the clearml-agent init command (I have to check the syntax), by providing the apiserver webeserver and fileserver url they'll be registered to the clearml cluster?
also, if I clone an experiment on wich I had to set the k8s-queue user property manually to run experiments on a queue, say cpu, and enqueue it to a different queue, say gpu, the property is not updated, and the experiment is enqueued in a queue with a random hash like name. I either have to delete the attribute, or set it to the right queue name, before enqueuing it, to have it run in the right queue
OK, it wasn't the clearml.conf settings...
In the deployment I was referring to the fileserver, apiserver, etc. with the internal kubernetes dns names.
I changed them to the one exposed to the users (the same I have in my local clearml.conf) and things work.
But I can't really figure out why that would be the case...