Reputation
Badges 1
137 × Eureka!now, I go to experiment, clone an experiment that I previously executed on my laptop. In the newly created experiment, I modify some parameter, and enqueue the experiment in the CPU queue.
I see in bitnami's gh-pages branch a file https://github.com/bitnami-labs/sealed-secrets/blob/gh-pages/index.html to do the redirect that contains:
` <html>
<head> <meta http-equiv="refresh" content="0; url= ` ` "> </head> <p><a href=" ` ` ">Redirect to repo index.yaml</a></p> </html> ` A similar file is missing in the ` clearml-helm-chart ` ` gh-pages ` branch.
The workaround that works for me is:
clone the experiment that I run on my laptop in the newly cloned experiment, modify the hyperparameters and configurations to my need in user properties set "k8s-queue" to "cpu" (or the name of queue I want to use) enqueue the experiment to the same queue I just set...
When I do like that in the K8sGlue pod for the cpu queue I can see that it has been correctly picked up:
` No tasks in queue 54d3edb05a89462faaf51e1c878cf2c7
No tasks in Queues, sleeping fo...
Thanks Martin! If I end up having sometime I'll dig into the code and check if I can bake something!
What I still don't get, is how you would create different queues, targeting different nodes with different GPUs, and having them using the appropriate Cuda image.
Looking at the template, I don't understand how that's possible.
Ah sorry, I thought what where the names of the queue I created like (in case I used some weird character or stuff like that)
Thanks Martin.. I'll add this and check whether it fixes the issue, but I don't get quite well this though.. The local code doesn't need to import pandas, because the get method returns a DataFrame object that has a .loc
method.
I was expecting the remote experiment to behave similarly, why do I need to import pandas there?
If I now reset the experiment, and enqueue the experiment to the gpu queue (but in the experimet, the user-properties configuration for k8s-glue is still set to cpu) the experiment is left in a Pending state... and in the K8sGlue Agent for the gpu queue, I can see a similar error as the one in the cpu agent....
` No tasks in Queues, sleeping for 5.0 seconds
No tasks in queue 75174e0e7ac047f195ab4dce6e9f03f7
No tasks in Queues, sleeping for 5.0 seconds
FATAL ERROR:
Traceback (most recent call...
Thanks Martin, so if I understand correctly, when I do the clearml-agent init command (I have to check the syntax), by providing the apiserver webeserver and fileserver url they'll be registered to the clearml cluster?
Hi AgitatedDove14 I have spent some time going through the helm charts but I admit I still haven't clear how things should work.
I see that with the default values (mostly what I am using), the K8s Glue agent is deployed (which is what you suggested to use).
Hi Jake, sorry I left the office yesterday. On my laptop I have clearml==1.6.4
Yes, I still see those errors, but queues are working :)
Hi Martin, I admit I don't know about MIG I'll have to ask some of our engineers.
As for the memory, yes the reasoning is clear, the main thing we'll have to see is hot define the limits, because we have nodes with quite different resources available, and this might get tricky, but I'll try and let's see what happens 🙂
We actually plan to create different queues for different types of workloads, we are a bit seeing what the actual usage is to define what type of workloads make sense for us.
Super!!! many thanks CostlyFox64 !
but I don't understand the comment on GPUs as the documentation makes a lot of references on GPU configurations for agents
but I can confirm that adding the requirement with Task.add_requirements()
does the trick
Next week I can take some screenshots if you need them, ai just closed the laptop and will be off for a couple of days :))
Hi SuccessfulKoala55 I can confirm that the "id-like" queue created by ClearML
actually correspond to the id of queue "k8s_scheduler" (so it looks like that instead of submitting the experiment to the scheduler to be enqueued to the right queue), a new queue whose name corresponds to the id of the k8s_scheduler is created instead.
Hope this helps 🙂
PunyWoodpecker71 just create a Personal Access Token and use it as the value for CLEARML_AGENT_GIT_PASS, https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/creating-a-personal-access-token
using the --set
you adviced above right?
is there a way I can check whether the apiserver are reachable?
(like: https://clearml-apiserver.ds.bumble.dev/health http://ds.bumble.dev/health )
Thanks SuccessfulKoala55 . Any idea why going to the address https://allegroai.github.io/clearml-helm-charts
returns a 404 error?
Other repositories that are used in Argo CD examples (e.g. https://bitnami-labs.github.io/sealed-secrets , which is also hosted on Github) instead of returning a 404, the index.yaml page is loaded instead.
I suspect this might be the reason why I can't make it work with ClearML.