Reputation
Badges 1
137 × Eureka!the queues already exist, I created them through the UI.
Ah sorry, I thought what where the names of the queue I created like (in case I used some weird character or stuff like that)
Exactly that :) if I go in the queue tab, I see a new queue name (that I didn't create),
with a name like "4gh637aqetc"
Yes, the queue is created when I enqueue the experiment. I took some screenshots, and got the logs (there is an error effectively).
Let me share them with you...
If I now reset the experiment, and enqueue the experiment to the gpu queue (but in the experimet, the user-properties configuration for k8s-glue is still set to cpu) the experiment is left in a Pending state... and in the K8sGlue Agent for the gpu queue, I can see a similar error as the one in the cpu agent....
` No tasks in Queues, sleeping for 5.0 seconds
No tasks in queue 75174e0e7ac047f195ab4dce6e9f03f7
No tasks in Queues, sleeping for 5.0 seconds
FATAL ERROR:
Traceback (most recent call...
and in the logs of the K8s Glue I see an exception occurred:
` No tasks in queue 54d3edb05a89462faaf51e1c878cf2c7
No tasks in Queues, sleeping for 5.0 seconds
No tasks in queue 54d3edb05a89462faaf51e1c878cf2c7
No tasks in Queues, sleeping for 5.0 seconds
FATAL ERROR:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py", line 710, in urlopen
chunked=chunked,
File "/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py", l...
I have tried this several time and the behaviour is always the same. It looks like when I modify some hyperparameter, when I enqueue the experiment to one queue, things don't work if I didn't make sure to have previously set the value of k8s-queue to the name of the queue that I want to use. If I don't modify the configuration (e.g. I abort, or reset the job and enqueue it again, or clone and enqueue it without modifying the hyperparameters) then everything works as expected.
If now I abort the experiment (which is in a pending state and not running), and re-enqueue it again -- no parameters modifications this time...
and I re-enqueue it to the CPU queue, I see that it is sent to the right queue, and after a few seconds the job enters a running state and it completes correctly
At this point, I see a new queue in the UI:
Hi SuccessfulKoala55 I can confirm that the "id-like" queue created by ClearML
actually correspond to the id of queue "k8s_scheduler" (so it looks like that instead of submitting the experiment to the scheduler to be enqueued to the right queue), a new queue whose name corresponds to the id of the k8s_scheduler is created instead.
Hope this helps 🙂
no, there's no task with a name of cpu or gpu... Where can I find the id of the queue to check?2. what do you mean by initial log dumps, the very early row when it's being deployed?
Anyway, sure I can send it to you, but I just turned off my laptop :) and won't be able for a few days.
Thanks, in DM I sent you the conf we use to deploy the agents.
My understanding is that in Task.init, you have a reuse_last_task_id (or similar name) that defaults to True.. In that case if your experiment wasn't "published" it will be overwritten, (based on project and experiment name). However, if you do publish it, a new experiment would be created
Thanks CostlyOstrich36 I was thinking more to a setting of the environment, for example the documentation mentions the "--cpu-only" flag (which I am not sure I can use as I am using the helm charts from AllegroAI, I don't think I can override the command), or to set the env var NVIDIA_VISIBLE_DEVICES to an empty string (which I did, but I can still see the message)
As much as possible, I'd like removing the burden off the shoulders of people writing their models
PunyWoodpecker71 just create a Personal Access Token and use it as the value for CLEARML_AGENT_GIT_PASS, https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/creating-a-personal-access-token
Next week I can take some screenshots if you need them, ai just closed the laptop and will be off for a couple of days :))
thanks, yes it makes sense!
well there are already processes in place.. we aim at migrating everything to ClearML, but we hoped we could do it gradually
Hi Jack, yes we had to customize the default one for some tools we use internally
Yes, I still see those errors, but queues are working :)
but I was a bit set off track seeing errors in the logs
I actually found out it was an indentation error 😅 and the credentials weren't picked
Hi Jake thanks for your answer!
So I just have a very simple file "project.py" with this content:
` from clearml import Task
task = Task.init(project_name='project-no-git', task_name='experiment-1')
import pandas as pd
print("OK") If I run
python project.py ` from a folder that is not in a git repository, I can clone the task and enqueue it from the UI, and ti runs in the agent with no problems.
If I copy the same file, in a folder that is in a git repository, when I enqueue the ex...
And yes, I am using the agents that come with the Helm chart from Clearml repository