
Reputation
Badges 1
137 × Eureka!Hi Martin, thanks. My doubt is:
if I configure manually the pods for the different nodes, how do I make clearml server aware that those agents exist? This step is really not clear to me from the documentation (it talks about user, and it uses interactive commands which would mean entering in the agents manually) I will try also the k8s glue, but I would like first to understand how to configure a fixed number of agents manually
Thanks Martin, so if I understand correctly, when I do the clearml-agent init command (I have to check the syntax), by providing the apiserver webeserver and fileserver url they'll be registered to the clearml cluster?
PunyWoodpecker71 just create a Personal Access Token and use it as the value for CLEARML_AGENT_GIT_PASS, https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/creating-a-personal-access-token
thanks for the help!
also, if I clone an experiment on wich I had to set the k8s-queue user property manually to run experiments on a queue, say cpu, and enqueue it to a different queue, say gpu, the property is not updated, and the experiment is enqueued in a queue with a random hash like name. I either have to delete the attribute, or set it to the right queue name, before enqueuing it, to have it run in the right queue
I have tried this several time and the behaviour is always the same. It looks like when I modify some hyperparameter, when I enqueue the experiment to one queue, things don't work if I didn't make sure to have previously set the value of k8s-queue to the name of the queue that I want to use. If I don't modify the configuration (e.g. I abort, or reset the job and enqueue it again, or clone and enqueue it without modifying the hyperparameters) then everything works as expected.
Hi SuccessfulKoala55 I can confirm that the "id-like" queue created by ClearML
actually correspond to the id of queue "k8s_scheduler" (so it looks like that instead of submitting the experiment to the scheduler to be enqueued to the right queue), a new queue whose name corresponds to the id of the k8s_scheduler is created instead.
Hope this helps 🙂
Right now I see the default agent that comes with the helm chart...
yes, the curl returned a 503 error
the queues already exist, I created them through the UI.
Thanks, in DM I sent you the conf we use to deploy the agents.
Yes, I still see those errors, but queues are working :)
At this point, I see a new queue in the UI:
And yes these appear in the dropdown menu when I want to enqueue an experiment
OK, I'll report that, and see if we can get it fixed
but for sigterm you should be able to set cleanup steps, no?
CostlyOstrich36 so I don't have to write the clearml.conf?
I would like to setup things so that a data scientist working on a project doesn't have to know about buckets and this sort of things... Ideally the server and the agents are configured with a default bucket...
So I set this to sdk.development.default_output_uri: <url to fileserver>
in the K8s Agent Glue pod, but DS in order for models to be uploaded,
you still have to set: output_uri=True
in the Task.init()
command...
otherwise artifacts are only stored on my laptop, and in the "models" I see a uri like : file:///Users/luca/etc.etc./
So I am not really sure what this does
What I still don't get, is how you would create different queues, targeting different nodes with different GPUs, and having them using the appropriate Cuda image.
Looking at the template, I don't understand how that's possible.
I can ping it without issues, but I am not sure if the communications are set correctly
OK I could connect with the SDK, so everything is working, I'd just like to get the right hosts shown in the UI when a new token is created
Thanks SuccessfulKoala55 . Any idea why going to the address https://allegroai.github.io/clearml-helm-charts
returns a 404 error?
Other repositories that are used in Argo CD examples (e.g. https://bitnami-labs.github.io/sealed-secrets , which is also hosted on Github) instead of returning a 404, the index.yaml page is loaded instead.
I suspect this might be the reason why I can't make it work with ClearML.
thanks a lot 🙂 that was quick 🙂
the experiment is supposed tu run in this queue, but then it hangs in a pending scheduler