Answered

Hi Folks, I Just Deployed A Clearml Agent Using The Helm Chart. I Have A Few Doubts:

Hi folks, I just deployed a ClearML agent using the Helm chart. I have a few doubts:
after the deployment, I see a new queue called k8s_scheduler, which I didn't create. Is this normal? in the helm Chart, I specified that I want the AgentK8sGlue to process the queue cpu . When I enqueue an experiment to this queue, I see that there is a worker called "k8s-agent", with an ip associated, but the "currently executing" column stays empty. How can I look the logs of this agent (I can't see any deployment nor pod in the same namespace as the Agent), so I don't know where it is running. Also, I see an experiment enqueed in the "default" queue. If I delete it, it "kills" the one that was enqueued in the "cpu" queue, is this a bug?

  				
Posted 
	2 years ago

					More  		
  Report
		
					SarcasticSquirrel56
				
					0
					 × 1

Votes Newest

Answers 31

this means network issues at some level

  				
Posted 
	2 years ago

					More  		
  Report
		
					JuicyFox94
				
					0
					 × 1

try this into the pod

  				
Posted 
	2 years ago

					More  		
  Report
		
					JuicyFox94
				
					0
					 × 1

but it;s just a quick guess, not sure if i’m right

  				
Posted 
	2 years ago

					More  		
  Report
		
					JuicyFox94
				
					0
					 × 1

exactly

  				
Posted 
	2 years ago

					More  		
  Report
		
					JuicyFox94
				
					0
					 × 1

you can workaround the issue mlunting the kubeconfig but I guess the issue is someway to be investigated

  				
Posted 
	2 years ago

					More  		
  Report
		
					JuicyFox94
				
					0
					 × 1

thanks for the help!

  				
Posted 
	2 years ago

					More  		
  Report
		
					SarcasticSquirrel56
				
					0
					 × 1

accessing apiserver from a pod doesn’t require kubeconfig

  				
Posted 
	2 years ago

					More  		
  Report
		
					JuicyFox94
				
					0
					 × 1

Thanks Valeriano, so copying the .kube/config file from a node from which I can run kubectl I could run kubectl commands correctly

  				
Posted 
	2 years ago

					More  		
  Report
		
					SarcasticSquirrel56
				
					0
					 × 1

I don’t think it’s related how agent talk with apiserver or fileserver. It’s more related the fact agent pod internal kubectl cannot contact kubernetes apiserver

  				
Posted 
	2 years ago

					More  		
  Report
		
					JuicyFox94
				
					0
					 × 1

probably you will see it’s not capable of doing it and it should be related k8s config

  				
Posted 
	2 years ago

					More  		
  Report
		
					JuicyFox94
				
					0
					 × 1

# Point to the internal API server hostname APISERVER= `

Path to ServiceAccount token

SERVICEACCOUNT=/var/run/secrets/kubernetes.io/serviceaccount

Read this Pod's namespace

NAMESPACE=$(cat ${SERVICEACCOUNT}/namespace)

Read the ServiceAccount bearer token

TOKEN=$(cat ${SERVICEACCOUNT}/token)

Reference the internal certificate authority (CA)

CACERT=${SERVICEACCOUNT}/ca.crt

Explore the API with TOKEN

curl --cacert ${CACERT} --header "Authorization: Bearer ${TOKEN}" -X GET ${APISERVER}/api `

  				
Posted 
	2 years ago

					More  		
  Report
		
					JuicyFox94
				
					0
					 × 1

from https://kubernetes.io/docs/tasks/run-application/access-api-from-pod/

  				
Posted 
	2 years ago

					More  		
  Report
		
					JuicyFox94
				
					0
					 × 1

I suggest to try exec into agent pod and try to do some kubectl command like a simple kubectl get pod

  				
Posted 
	2 years ago

					More  		
  Report
		
					JuicyFox94
				
					0
					 × 1

I wouldn’t say it’s related RBAC because the issue seems Networking realted so connection timed out

  				
Posted 
	2 years ago

					More  		
  Report
		
					JuicyFox94
				
					0
					 × 1

just my two cents

  				
Posted 
	2 years ago

					More  		
  Report
		
					JuicyFox94
				
					0
					 × 1

ah I see, I'll give it a try then

  				
Posted 
	2 years ago

					More  		
  Report
		
					SarcasticSquirrel56
				
					0
					 × 1

yes, the curl returned a 503 error

  				
Posted 
	2 years ago

					More  		
  Report
		
					SarcasticSquirrel56
				
					0
					 × 1

JuicyFox94 apparently to make it work I'll have to add a "kubeconfig" file, but I can't see any obvious way to mount it in the agent pod, am I wrong?

  				
Posted 
	2 years ago

					More  		
  Report
		
					SarcasticSquirrel56
				
					0
					 × 1

if it turns 503 it’s not network but something on top of it

  				
Posted 
	2 years ago

					More  		
  Report
		
					JuicyFox94
				
					0
					 × 1

that's what I wanted to ask, while the proper networking is setup (I don't manage the cluster),
can I do tests using the .kube/config?

  				
Posted 
	2 years ago

					More  		
  Report
		
					SarcasticSquirrel56
				
					0
					 × 1

  				
Posted 
	2 years ago

					More  		
  Report
		
					SarcasticSquirrel56
				
					0
					 × 1

probably
is not accessible from your pod

  				
Posted 
	2 years ago

					More  		
  Report
		
					JuicyFox94
				
					0
					 × 1

OK, I'll report that, and see if we can get it fixed

  				
Posted 
	2 years ago

					More  		
  Report
		
					SarcasticSquirrel56
				
					0
					 × 1

OK, thanks a lot, I'll try to get the networking thing sorted (and then I am sure I'll have lots more many doubts 😂 )

  				
Posted 
	2 years ago

					More  		
  Report
		
					SarcasticSquirrel56
				
					0
					 × 1

because kubectl inside pod uses inpod method

  				
Posted 
	2 years ago

					More  		
  Report
		
					JuicyFox94
				
					0
					 × 1

btw it’s an infra issue

  				
Posted 
	2 years ago

					More  		
  Report
		
					JuicyFox94
				
					0
					 × 1

I guess yes but honestly I’m not sure you will get the right results

  				
Posted 
	2 years ago

					More  		
  Report
		
					JuicyFox94
				
					0
					 × 1

because while I can run kubectl commands from within the agent pod, clearml doesn't seem to pick the right value:

2022-08-05 12:09:47 task 29f1645fbe1a4bb29898b1e71a8b1489 pulled from 51f5309bfb1940acb514d64931ffddb9 by worker k8s-agent-cpu 2022-08-05 12:12:59 Running kubectl encountered an error: Unable to connect to the server: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers) 2022-08-05 15:15:07 task 29f1645fbe1a4bb29898b1e71a8b1489 pulled from 51f5309bfb1940acb514d64931ffddb9 by worker k8s-agent-cpu

  				
Posted 
	2 years ago

					More  		
  Report
		
					SarcasticSquirrel56
				
					0
					 × 1

yep but this is not how it should work with inpod

  				
Posted 
	2 years ago

					More  		
  Report
		
					JuicyFox94
				
					0
					 × 1

Thanks for pitching in JuicyFox94 . For the connectivity, I used the "public" names for the various server
(e.g. we set clearml.internal.domain.name, clearml-apiserver.internal.domain.name and clearml-apiserver.internal.domain.name)

So in the agent values.yaml I set the parameters:
# -- Reference to Api server url apiServerUrlReference: " ` "

-- Reference to File server url

fileServerUrlReference: " "

-- Reference to Web server url

webServerUrlReference: " " `to the actual domain name.

Should I try using the internal kubernetes domain names instead, like
clearml-fileserver.clearml.svc.cluster.localinstead?

  				
Posted 
	2 years ago

					More  		
  Report
		
					SarcasticSquirrel56
				
					0
					 × 1

Show more results

Write your answer

59K Views

31 Answers

2 years ago

12 months ago