Reputation
Badges 1
137 × Eureka!Hi Alon, thanks, I actually watched those videos. But they don't help with settings things up 🙂
From your explanation, I understand that Agents are indeed needed for ClearML to work.
Effectively kubectl commands don't work from within the agent pod, I'll try to figure out why
I have tried this several time and the behaviour is always the same. It looks like when I modify some hyperparameter, when I enqueue the experiment to one queue, things don't work if I didn't make sure to have previously set the value of k8s-queue to the name of the queue that I want to use. If I don't modify the configuration (e.g. I abort, or reset the job and enqueue it again, or clone and enqueue it without modifying the hyperparameters) then everything works as expected.
Thanks SuccessfulKoala55 . Any idea why going to the address https://allegroai.github.io/clearml-helm-charts
returns a 404 error?
Other repositories that are used in Argo CD examples (e.g. https://bitnami-labs.github.io/sealed-secrets , which is also hosted on Github) instead of returning a 404, the index.yaml page is loaded instead.
I suspect this might be the reason why I can't make it work with ClearML.
perfect, let me try 🙂 (thanks a lot for all the help!)
Hi AgitatedDove14 I have spent some time going through the helm charts but I admit I still haven't clear how things should work.
I see that with the default values (mostly what I am using), the K8s Glue agent is deployed (which is what you suggested to use).
many thanks 🙂 I am going to play with ClearML a little bit and re-read carefully the thread to learn something from what you made me do today!
Hi SuccessfulKoala55 I can confirm that the "id-like" queue created by ClearML
actually correspond to the id of queue "k8s_scheduler" (so it looks like that instead of submitting the experiment to the scheduler to be enqueued to the right queue), a new queue whose name corresponds to the id of the k8s_scheduler is created instead.
Hope this helps 🙂
now, I go to experiment, clone an experiment that I previously executed on my laptop. In the newly created experiment, I modify some parameter, and enqueue the experiment in the CPU queue.
Before any experiment enqueueing, theare are the queue I have available
the experiment is supposed tu run in this queue, but then it hangs in a pending scheduler
Next week I can take some screenshots if you need them, ai just closed the laptop and will be off for a couple of days :))
Exactly that :) if I go in the queue tab, I see a new queue name (that I didn't create),
with a name like "4gh637aqetc"
also, if I clone an experiment on wich I had to set the k8s-queue user property manually to run experiments on a queue, say cpu, and enqueue it to a different queue, say gpu, the property is not updated, and the experiment is enqueued in a queue with a random hash like name. I either have to delete the attribute, or set it to the right queue name, before enqueuing it, to have it run in the right queue
Yes, the queue is created when I enqueue the experiment. I took some screenshots, and got the logs (there is an error effectively).
Let me share them with you...
and in the logs of the K8s Glue I see an exception occurred:
` No tasks in queue 54d3edb05a89462faaf51e1c878cf2c7
No tasks in Queues, sleeping for 5.0 seconds
No tasks in queue 54d3edb05a89462faaf51e1c878cf2c7
No tasks in Queues, sleeping for 5.0 seconds
FATAL ERROR:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py", line 710, in urlopen
chunked=chunked,
File "/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py", l...
I guess to achieve what I want, I could disable the agent using the helm chart values.yaml
and then define pods for each of the agent on their respective nodes
Ah sorry, I thought what where the names of the queue I created like (in case I used some weird character or stuff like that)
no, there's no task with a name of cpu or gpu... Where can I find the id of the queue to check?2. what do you mean by initial log dumps, the very early row when it's being deployed?
Anyway, sure I can send it to you, but I just turned off my laptop :) and won't be able for a few days.
I am not aware of how clearml-dataset works, but I'll have a look 🙂
ah I see, I'll give it a try then
I see... because the problem it would be with permissions when creating artifacts to store in the "/shared" folder
great! thanks a lot!
As much as possible, I'd like removing the burden off the shoulders of people writing their models