Hi Everyone, I'M Experiencing An Issue With Clearml Running On K8S. After Upgrading The Clearml Server Helm Chart From Version 7.11.5, I'M Seeing The Following Errors: In The Agent:

Answered

Hi everyone,
I'm experiencing an issue with ClearML running on K8S. After upgrading the ClearML server helm chart from version 7.11.5, I'm seeing the following errors:

In the agent:

[2025-02-16 17:26:45,889] [9] [WARNING] [clearml.service_repo] Returned 400 for tasks.enqueue in 4ms, msg=Validation error (Cannot skip setting execution queue for a task that is not enqueued or does not have execution queue set)

In the clearml-server-api-pod:

[2025-02-16 19:13:45,658] [9] [WARNING] [clearml.service_repo] Returned 400 for queues.remove_task in 3ms, msg=Invalid queue id or task not in queue: task=37c6ce31c53d449994f7c9096c26d6f7, id=99d6dd77e67a4f12bb5de901596e0e1e, company=d1bd92a3b039400cbafc60a7a5b1e52b

[2025-02-16 19:13:45,669] [9] [WARNING] [clearml.service_repo] Returned 400 for tasks.enqueue in 3ms, msg=Validation error (Cannot skip setting execution queue for a task that is not enqueued or does not have execution queue set)

I've tried several versions, including the latest 7.14.2, but the error persists. For testing, I'm using a simple pipeline:

import clearml
from clearml import PipelineController

pipe = PipelineController(
    name='simple-pipeline',
    project='hello-world-project',
    version='1.0.0',
)

pipe.set_default_execution_queue('default')

def say_hello():
    print("Hello World!")
    return {"message": "Hello World!"}

pipe.add_function_step(
    name='hello-step',
    function=say_hello,
    function_return=['hello_result']
)

pipe.start(queue='default')

I believe the issue lies with the clearml-apiserver. When I downgrade the clearml-apiserver image in the helm chart back to version 1.16.2-502, the agent successfully picks up the job.

Additional information:

My Kubernetes version is 29.2.10
I've reproduced this issue on other K8s versions
The problem persists even when using the default values.yaml

  				
Posted 
	one month ago

					More  		
  Report
		
					WorriedSwan6
				
					0
					 × 1

Votes Newest

Answers 8

Might be this None

  				
Posted 
	one month ago

					More  		
  Report
		
					WobblyFrog79
				
					0
					 × 1

Will do

  				
Posted 
	one month ago

					More  		
  Report
		
					WobblyFrog79
				
					0
					 × 1

Hi WorriedSwan6

On a different issue, have you any solution on how to make the agent listen to multiply queues?

each agent is connected with one type of queue that represents the Job that agent will create. You can connect to it multiple queues, and it will pull from creating the same "type" of job regardless of where it's coming from. If you want another job to be created, just spin another agent, there is no limit to the number of agents you can spin in the cluster (they do not actually require a lot of resources, they sleep most of the time 🙂 )
Is this what you had in mind?

  				
Posted 
	one month ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Hey WobblyFrog79 , yes testing this locally it does seems to solve the issue, thank you.
I will test it in our env.

On a different issue, have you any solution on how to make the agent listen to multiply queues?
On the helm it is written :

  # -- ClearML queue this agent will consume. Multiple queues can be specified with the following format: queue1,queue2,queue3

But this does not work as the agent will read them all as one queue

  				
Posted 
	one month ago

					More  		
  Report
		
					WorriedSwan6
				
					0
					 × 1

This hasn’t worked for me either, I use multiple queues instead. Another reason I also use multiple queues is because I need to specify different resource requirements for pods launched by each queue (CPU-only vs GPU).

  				
Posted 
	one month ago

					More  		
  Report
		
					WobblyFrog79
				
					0
					 × 1

Hey Martin, do you know how to connect the agent to multiply queues?

  				
Posted 
	one month ago

					More  		
  Report
		
					WorriedSwan6
				
					0
					 × 1

AgitatedDove14 for me it hasn’t worked when I specified agentk8sglue.queue: "queue1,queue2" in the Helm chart options which should be possible according to documentation. What also hasn’t worked is that flag for creating a queue if it doesn’t exists ( agentk8sglue.createQueueIfNotExists ). Both failed parsing at runtime, so those are 2 bugs I’d say.

  				
Posted 
	one month ago

					More  		
  Report
		
					WobblyFrog79
				
					0
					 × 1

hmm, yes it should create the queue if it's missing (btw you could work around that and create it in the UI). Any chance you can open a github issue in the clearml helm chart repo so we do not forget ?

  				
Posted 
	one month ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Write your answer

166 Views

8 Answers

one month ago