Hi Folks, I Did A Deployment Of Clearml Using The K8S Helm Chart, And I Set The Agent Using K8S Glue. I Run A Task Locally, And I Went To The Ui Cloned The Experiment And Scheduled It In The Default Queue. After Doing This, I See That The Experiment Is Q

Answered

Hi folks, I did a deployment of ClearML using the K8s helm chart, and I set the agent using K8s Glue.

I run a task locally, and I went to the UI cloned the experiment and scheduled it in the default queue.
After doing this, I see that the experiment is queued in the "k8s_scheduler", and stays in a Pending state.

Any idea what might be causing the issue?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SarcasticSquirrel56
				
					0
					 × 1

Votes Newest

Answers 31

SarcasticSquirrel56 I think that can be quite easily added - can you please open a GitHub issue in https://github.com/allegroai/clearml-helm-charts ?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

so I assume clearml moves them from one queue to the other?

Correct. When it creates the k8s job and launches it on the cluster it moves it into the queue.
Can you see it on your k8s cluster (meaning the job/pod)?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Hi Martin, I'll try to get the logs on Monday, though the K8s configuration doesn't "scare" me, I can solve that with my colleagues.
But I'll share it if it helps debug the issue

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SarcasticSquirrel56
				
					0
					 × 1

But as Gaspard was saying, with the default config there is no agent listening to the "k8s_scheduler" queue with the default settings

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SarcasticSquirrel56
				
					0
					 × 1

okay this points to an issue with the k8s glue, I think it somehow failed to launch the pod. Can you send me the log of the clearml-k8s-glue ?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Yes, I add it to "default" queue (which is the one used in the config file for the k8 glue agent)

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SarcasticSquirrel56
				
					0
					 × 1

Pretty confusing that neither

services

StickyLizard47 basically this is how a services queue agent should be spinned:
https://github.com/allegroai/clearml-server/blob/9b108740da21f25407bd2c59583ca1c86f8e1faa/docker/docker-compose.yml#L123
When spinning on a k8s cluster, this is a bit more complicated, as it needs to work with the clearml-k8s-glue.
See here how to spin it on k8s
https://github.com/allegroai/clearml-agent/tree/master/docker/k8s-glue

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

so I assume clearml moves them from one queue to the other?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SarcasticSquirrel56
				
					0
					 × 1

Martin I told you I can't access the resources in the cluster unfortunately

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SarcasticSquirrel56
				
					0
					 × 1

I can see the outputs from argo, so I know if some resource has been created but I can't inspect the full logs,
the ones I have available are all records similar to
No tasks in queue 80247f703053470fa60718b4dff7a576

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SarcasticSquirrel56
				
					0
					 × 1

sounds good 🙂

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

And I see that it is moved to the k8s_scheduler one instead (though I see that in the "default" queue I do have jobs)

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SarcasticSquirrel56
				
					0
					 × 1

After trying Gaspard changes to the helm chart values, I do now see that also a pod for the agentservice is deployed,
And some of the logs point to a misconfigurations on my side (the fact it can't access resources externally),
some others I don't understand:
Err:1 bionic InRelease Could not connect to archive.ubuntu.com:80 (185.125.190.36), connection timed out Could not connect to archive.ubuntu.com:80 (185.125.190.39), connection timed out Could not connect to archive.ubuntu.com:80 (91.189.91.38), connection timed out Could not connect to archive.ubuntu.com:80 (91.189.91.39), connection timed out Err:2 bionic-updates InRelease Unable to connect to archive.ubuntu.com:http: Err:3 bionic-backports InRelease Unable to connect to archive.ubuntu.com:http: Err:4 bionic-security InRelease Could not connect to security.ubuntu.com:80 (185.125.190.36), connection timed out Could not connect to security.ubuntu.com:80 (91.189.91.38), connection timed out Could not connect to security.ubuntu.com:80 (91.189.91.39), connection timed out Could not connect to security.ubuntu.com:80 (185.125.190.39), connection timed out Reading package lists... W: Failed to fetch Could not connect to archive.ubuntu.com:80 (185.125.190.36), connection timed out Could not connect to archive.ubuntu.com:80 (185.125.190.39), connection timed out Could not connect to archive.ubuntu.com:80 (91.189.91.38), connection timed out Could not connect to archive.ubuntu.com:80 (91.189.91.39), connection timed out W: Failed to fetch Unable to connect to archive.ubuntu.com:http: W: Failed to fetch Unable to connect to archive.ubuntu.com:http: W: Failed to fetch Could not connect to security.ubuntu.com:80 (185.125.190.36), connection timed out Could not connect to security.ubuntu.com:80 (91.189.91.38), connection timed out Could not connect to security.ubuntu.com:80 (91.189.91.39), connection timed out Could not connect to security.ubuntu.com:80 (185.125.190.39), connection timed out W: Some index files failed to download. They have been ignored, or old ones used instead. Reading package lists... Building dependency tree... Reading state information... E: Unable to locate package curl E: Unable to locate package python3-pip E: Unable to locate package git /bin/sh: 1: python3: not found /bin/sh: 1: python3: not found /bin/sh: 1: clearml-agent: not found

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SarcasticSquirrel56
				
					0
					 × 1

Click on the "k8s_schedule" queue, then on the right hand side, you should see your Task, click on it, it will open the Task page. There click on the "Info" Tab, there look for "STATUS MESSAGE" and "STATUS REASON". What do you have there?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

think it's because the proxy env var are not passed to the container ...

Yes this seems correct, the errors point to a network issues, i.e. the container does not seem to be able to connect to the clearml-server

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

In the ClearML ui it stays in a Pending state

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SarcasticSquirrel56
				
					0
					 × 1

Yes, on the GitHub repo readme there is a brief description of the k8s-glue, but that's it.
Pretty confusing that neither services nor scheduled work out of the box to be honest

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					StickyLizard47
				
					0

StickyLizard47 apologies for the https://github.com/allegroai/clearml-server/issues/140 not being followed (probably slipped through the cracks of backend guys, I can see the 1.5 release happened in parallel). Let me make sure it is followed.
SarcasticSquirrel56 specifically, did you also spin a clearml-k8s glue? or are the agents statically allocated on the helm chart?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

AgitatedDove14 I used the default configuration from the helm chart for the k8s glue.
The way I understand it is that K8s glue agent is enabled by default (and I do see a Deployment for clearml-k8sagent

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SarcasticSquirrel56
				
					0
					 × 1

By the way, after fixing the agentservice issue, and having the pod configured correctly, now I see an error in the agentgroup-cpu pod, because it says that the token is not the correct one:

http://:8081 http://:8080 `

WARNING: Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ConnectTimeoutError(<pip._vendor.urllib3.connection.HTTPSConnection object at 0x7fa4b00438d0>, 'Connection to pypi.org timed out. (connect timeout=15)')': /simple/clearml-agent/

WARNING: Retrying (Retry(total=3, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ConnectTimeoutError(<pip._vendor.urllib3.connection.HTTPSConnection object at 0x7fa4b0043a20>, 'Connection to pypi.org timed out. (connect timeout=15)')': /simple/clearml-agent/

WARNING: Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ConnectTimeoutError(<pip._vendor.urllib3.connection.HTTPSConnection object at 0x7fa4b0043c18>, 'Connection to pypi.org timed out. (connect timeout=15)')': /simple/clearml-agent/

WARNING: Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ConnectTimeoutError(<pip._vendor.urllib3.connection.HTTPSConnection object at 0x7fa4b0043d68>, 'Connection to pypi.org timed out. (connect timeout=15)')': /simple/clearml-agent/

WARNING: Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ConnectTimeoutError(<pip._vendor.urllib3.connection.HTTPSConnection object at 0x7fa4b0043eb8>, 'Connection to pypi.org timed out. (connect timeout=15)')': /simple/clearml-agent/

WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead:

Using environment access key CLEARML_API_ACCESS_KEY=ENP39EQM4SLACGD5FXB7

Using environment secret key CLEARML_API_SECRET_KEY=********

clearml_agent: ERROR: Failed getting token (error 401 from ): Unauthorized (invalid credentials) (failed to locate provided credentials) `

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SarcasticSquirrel56
				
					0
					 × 1

I will!

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SarcasticSquirrel56
				
					0
					 × 1

I think it's because the proxy env var are not passed to the container (I thought they were the same as the extraArgs from the agentservice, but it doesn't look like that's the case)

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SarcasticSquirrel56
				
					0
					 × 1

If you find out more or get an official explaination, please keep me posted 🙂

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					StickyLizard47
				
					0

I see it in the UI...

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SarcasticSquirrel56
				
					0
					 × 1

Martin I told you I can't access the resources in the cluster unfortunately

😞
so it seems there is some misconfiguration of the k8s glue, because we can see it can "talk" to the clearml-server, but it seems it fails to actually create the k8s pod/job. I would start with debugging the k8s glue (not the services agents). Regardless, I think the next step is to get a log of the k8s glue pod, and better understand the issue.
wdyt?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

This is good news, that means the k8s glue created a k8s job and pushed the Task into the "k8s_scheduler" queue, for visibility (i.e. it is now the k8s job to launch the pod).
Can you check on the Task Info tab what is the status/message ? (it should reflect the k8s pod status)

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

The way I understand it is that K8s glue agent is enabled by default (and I do see a Deployment for

clearml-k8sagent

SarcasticSquirrel56
Good start, when you say you see the Task in ""k8s_scheduler" queue, originally did you enqueue it to "default" ?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

unfortunately I can't get info from the cluster

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SarcasticSquirrel56
				
					0
					 × 1

is no agent listening to the "k8s_scheduler"

There should not be one, this is purely "virtual" , so users understand the k8s cluster is spinning their pod (sometimes it takes time, imagine EKS etc. , just visibility)

unfortunately I can't get info from the cluster

You should be able the pod in the cluster no?!
What's the Task Info panel say, can you share a screen shot ?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Thanks, adding environment variables to the agentservice solved it, but for the agentgroup agent, I can't see any obvious way to inject environment variables. In the helm chart template I don't see any way to pass custom environment variables to the pod

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SarcasticSquirrel56
				
					0
					 × 1

Show more results

Write your answer

146K Views

31 Answers

3 years ago

one year ago