
Reputation
Badges 1
137 × Eureka!I can ping it without issues, but I am not sure if the communications are set correctly
I actually found out it was an indentation error 😅 and the credentials weren't picked
Hi Martin, I admit I don't know about MIG I'll have to ask some of our engineers.
As for the memory, yes the reasoning is clear, the main thing we'll have to see is hot define the limits, because we have nodes with quite different resources available, and this might get tricky, but I'll try and let's see what happens 🙂
We actually plan to create different queues for different types of workloads, we are a bit seeing what the actual usage is to define what type of workloads make sense for us.
especially for datasets (for the models and other files we were thinking to use the fileserver any way)
but I don't understand the comment on GPUs as the documentation makes a lot of references on GPU configurations for agents
If I now reset the experiment, and enqueue the experiment to the gpu queue (but in the experimet, the user-properties configuration for k8s-glue is still set to cpu) the experiment is left in a Pending state... and in the K8sGlue Agent for the gpu queue, I can see a similar error as the one in the cpu agent....
` No tasks in Queues, sleeping for 5.0 seconds
No tasks in queue 75174e0e7ac047f195ab4dce6e9f03f7
No tasks in Queues, sleeping for 5.0 seconds
FATAL ERROR:
Traceback (most recent call...
Thanks Martin! If I end up having sometime I'll dig into the code and check if I can bake something!
OK, so... when executed locally "train" prints:
` train:
SepalLength SepalWidth PetalLength PetalWidth Species
122 7.7 2.8 6.7 2.0 2.0
86 6.7 3.1 4.7 1.5 1.0
59 5.2 2.7 3.9 1.4 1.0
4 5.0 3.6 1.4 0.2 0.0
77 6.7 3.0 5.0 1.7 1.0
.. ... ... ... ... ......
Hi Martin, thanks for the explanation! I work with Maggie and help with the ClearML setup.
Just to be sure, currently, the PodTemplate contains:
resources: limits: nvidia.com/gpu: 1
you are suggesting to add also, something like:requests: memory: "100Mi" limits: memory: "200Mi"
is that correct?
On a related note, I am a bit puzzled by the fact that all the 4 GPUs are visible.
In the https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/ , i...
Oh I see... for some reason I thought that all the dependencies of the environment would be tracked by ClearML, but it's only the ones that actually get imported...
If locally one detects that pandas is installed and can be used to read the csv, wouldn't it be possible to store this information in the clearml server so that it can be implicitly added to the requirements?
no, there's no task with a name of cpu or gpu... Where can I find the id of the queue to check?2. what do you mean by initial log dumps, the very early row when it's being deployed?
Anyway, sure I can send it to you, but I just turned off my laptop :) and won't be able for a few days.
Yes, I still see those errors, but queues are working :)
My understanding is that in Task.init, you have a reuse_last_task_id (or similar name) that defaults to True.. In that case if your experiment wasn't "published" it will be overwritten, (based on project and experiment name). However, if you do publish it, a new experiment would be created
Thanks Martin.. I'll add this and check whether it fixes the issue, but I don't get quite well this though.. The local code doesn't need to import pandas, because the get method returns a DataFrame object that has a .loc
method.
I was expecting the remote experiment to behave similarly, why do I need to import pandas there?
Effectively kubectl commands don't work from within the agent pod, I'll try to figure out why
do I need something else in the clearml.conf?
is there a way I can check whether the apiserver are reachable?
(like: https://clearml-apiserver.ds.bumble.dev/health http://ds.bumble.dev/health )
Exactly that :) if I go in the queue tab, I see a new queue name (that I didn't create),
with a name like "4gh637aqetc"
ah I see, I'll give it a try then
PunyWoodpecker71 just create a Personal Access Token and use it as the value for CLEARML_AGENT_GIT_PASS, https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/creating-a-personal-access-token
You mean as output target for artifacts?
For example, for some of our models we create pdf reports, that we save in a folder in the NFS disk.
Next week I can take some screenshots if you need them, ai just closed the laptop and will be off for a couple of days :))
The workaround that works for me is:
clone the experiment that I run on my laptop in the newly cloned experiment, modify the hyperparameters and configurations to my need in user properties set "k8s-queue" to "cpu" (or the name of queue I want to use) enqueue the experiment to the same queue I just set...
When I do like that in the K8sGlue pod for the cpu queue I can see that it has been correctly picked up:
` No tasks in queue 54d3edb05a89462faaf51e1c878cf2c7
No tasks in Queues, sleeping fo...
yes, the curl returned a 503 error