
Reputation
Badges 1
137 × Eureka!Hi Jake unfortunately I realized we put a loadbalancer, so any address like addess.domain, would ping
Yes, I still see those errors, but queues are working :)
Martin I told you I can't access the resources in the cluster unfortunately
and one more question, in the values, I also see the values for the default tokens:
` credentials:
apiserver:
# -- Set for apiserver_key field
accessKey: "5442F3443MJMORWZA3ZH"
# -- Set for apiserver_secret field
secretKey: "BxapIRo9ZINi8x25CRxz8Wdmr2pQjzuWVB4PNASZqCtTyWgWVQ"
tests:
# -- Set for tests_user_key field
accessKey: "ENP39EQM4SLACGD5FXB7"
# -- Set for tests_user_secret field
secretKey: "lPcm0imbcBZ8mwgO7tpadutiS3gnJD05x9j7a...
And I see that it is moved to the k8s_scheduler one instead (though I see that in the "default" queue I do have jobs)
thanks for the help!
Hi Jack, yes we had to customize the default one for some tools we use internally
thanks, yes it makes sense!
ah I see, I'll give it a try then
but for sigterm you should be able to set cleanup steps, no?
Hi SuccessfulKoala55 I can confirm that the "id-like" queue created by ClearML
actually correspond to the id of queue "k8s_scheduler" (so it looks like that instead of submitting the experiment to the scheduler to be enqueued to the right queue), a new queue whose name corresponds to the id of the k8s_scheduler is created instead.
Hope this helps 🙂
What I still don't get, is how you would create different queues, targeting different nodes with different GPUs, and having them using the appropriate Cuda image.
Looking at the template, I don't understand how that's possible.
thanks a lot! So as long as we have the storageclass in our kubernetes cluster configured correctly, the new helm chart should work out of the box?
I can see the outputs from argo, so I know if some resource has been created but I can't inspect the full logs,
the ones I have available are all records similar toNo tasks in queue 80247f703053470fa60718b4dff7a576
Thanks CostlyOstrich36 I was thinking more to a setting of the environment, for example the documentation mentions the "--cpu-only" flag (which I am not sure I can use as I am using the helm charts from AllegroAI, I don't think I can override the command), or to set the env var NVIDIA_VISIBLE_DEVICES to an empty string (which I did, but I can still see the message)
And yes, I am using the agents that come with the Helm chart from Clearml repository
As much as possible, I'd like removing the burden off the shoulders of people writing their models
Hi Josh, the agents are running on top of K8s (I used the helm chart to deploy them, it uses K8s glue).
I'll add a sleep so that I have time to enter the pod, and get the clearml.conf and will send you the diff in a few minutes
because while I can run kubectl commands from within the agent pod, clearml doesn't seem to pick the right value:
` 2022-08-05 12:09:47
task 29f1645fbe1a4bb29898b1e71a8b1489 pulled from 51f5309bfb1940acb514d64931ffddb9 by worker k8s-agent-cpu
2022-08-05 12:12:59
Running kubectl encountered an error: Unable to connect to the server: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
2022-08-05 15:15:07
task 29f1645fbe1a4bb29898b1e71a8b1489...
Effectively kubectl commands don't work from within the agent pod, I'll try to figure out why
yes, the curl returned a 503 error
Thanks for pitching in JuicyFox94 . For the connectivity, I used the "public" names for the various server
(e.g. we set clearml.internal.domain.name, clearml-apiserver.internal.domain.name and clearml-apiserver.internal.domain.name)
So in the agent values.yaml I set the parameters:# -- Reference to Api server url apiServerUrlReference: "
` "
-- Reference to File server url
fileServerUrlReference: " "
-- Reference to Web server url
webServerUrlReference: " " `to ...
The workaround that works for me is:
clone the experiment that I run on my laptop in the newly cloned experiment, modify the hyperparameters and configurations to my need in user properties set "k8s-queue" to "cpu" (or the name of queue I want to use) enqueue the experiment to the same queue I just set...
When I do like that in the K8sGlue pod for the cpu queue I can see that it has been correctly picked up:
` No tasks in queue 54d3edb05a89462faaf51e1c878cf2c7
No tasks in Queues, sleeping fo...
Hi Martin, thanks. My doubt is:
if I configure manually the pods for the different nodes, how do I make clearml server aware that those agents exist? This step is really not clear to me from the documentation (it talks about user, and it uses interactive commands which would mean entering in the agents manually) I will try also the k8s glue, but I would like first to understand how to configure a fixed number of agents manually
JuicyFox94 apparently to make it work I'll have to add a "kubeconfig" file, but I can't see any obvious way to mount it in the agent pod, am I wrong?
I think it's because the proxy env var are not passed to the container (I thought they were the same as the extraArgs from the agentservice, but it doesn't look like that's the case)
Thanks Martin, so if I understand correctly, when I do the clearml-agent init command (I have to check the syntax), by providing the apiserver webeserver and fileserver url they'll be registered to the clearml cluster?