Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hi Folks, I Did A Deployment Of Clearml Using The K8S Helm Chart, And I Set The Agent Using K8S Glue. I Run A Task Locally, And I Went To The Ui Cloned The Experiment And Scheduled It In The Default Queue. After Doing This, I See That The Experiment Is Q

Hi folks, I did a deployment of ClearML using the K8s helm chart, and I set the agent using K8s Glue.

I run a task locally, and I went to the UI cloned the experiment and scheduled it in the default queue.
After doing this, I see that the experiment is queued in the "k8s_scheduler", and stays in a Pending state.

Any idea what might be causing the issue?

  
  
Posted 2 years ago
Votes Newest

Answers 31


The way I understand it is that K8s glue agent is enabled by default (and I do see a Deployment for

clearml-k8sagent

SarcasticSquirrel56
Good start, when you say you see the Task in ""k8s_scheduler" queue, originally did you enqueue it to "default" ?

  
  
Posted 2 years ago

so I assume clearml moves them from one queue to the other?

  
  
Posted 2 years ago

AgitatedDove14 I used the default configuration from the helm chart for the k8s glue.
The way I understand it is that K8s glue agent is enabled by default (and I do see a Deployment for clearml-k8sagent

  
  
Posted 2 years ago

unfortunately I can't get info from the cluster

  
  
Posted 2 years ago

In the ClearML ui it stays in a Pending state

  
  
Posted 2 years ago

I can see the outputs from argo, so I know if some resource has been created but I can't inspect the full logs,
the ones I have available are all records similar to
No tasks in queue 80247f703053470fa60718b4dff7a576

  
  
Posted 2 years ago

Hi Martin, I'll try to get the logs on Monday, though the K8s configuration doesn't "scare" me, I can solve that with my colleagues.
But I'll share it if it helps debug the issue

  
  
Posted 2 years ago

SarcasticSquirrel56 I think that can be quite easily added - can you please open a GitHub issue in https://github.com/allegroai/clearml-helm-charts ?

  
  
Posted 2 years ago

Yes, on the GitHub repo readme there is a brief description of the k8s-glue, but that's it.
Pretty confusing that neither services nor scheduled work out of the box to be honest

  
  
Posted 2 years ago

sounds good 🙂

  
  
Posted 2 years ago

This is good news, that means the k8s glue created a k8s job and pushed the Task into the "k8s_scheduler" queue, for visibility (i.e. it is now the k8s job to launch the pod).
Can you check on the Task Info tab what is the status/message ? (it should reflect the k8s pod status)

  
  
Posted 2 years ago

StickyLizard47 apologies for the https://github.com/allegroai/clearml-server/issues/140 not being followed (probably slipped through the cracks of backend guys, I can see the 1.5 release happened in parallel). Let me make sure it is followed.
SarcasticSquirrel56 specifically, did you also spin a clearml-k8s glue? or are the agents statically allocated on the helm chart?

  
  
Posted 2 years ago

is no agent listening to the "k8s_scheduler"

There should not be one, this is purely "virtual" , so users understand the k8s cluster is spinning their pod (sometimes it takes time, imagine EKS etc. , just visibility)

unfortunately I can't get info from the cluster

You should be able the pod in the cluster no?!
What's the Task Info panel say, can you share a screen shot ?

  
  
Posted 2 years ago

okay this points to an issue with the k8s glue, I think it somehow failed to launch the pod. Can you send me the log of the clearml-k8s-glue ?

  
  
Posted 2 years ago

I see it in the UI...

  
  
Posted 2 years ago

image

  
  
Posted 2 years ago

Martin I told you I can't access the resources in the cluster unfortunately

  
  
Posted 2 years ago

And I see that it is moved to the k8s_scheduler one instead (though I see that in the "default" queue I do have jobs)

  
  
Posted 2 years ago

Yes, I add it to "default" queue (which is the one used in the config file for the k8 glue agent)

  
  
Posted 2 years ago

After trying Gaspard changes to the helm chart values, I do now see that also a pod for the agentservice is deployed,
And some of the logs point to a misconfigurations on my side (the fact it can't access resources externally),
some others I don't understand:
Err:1 bionic InRelease Could not connect to archive.ubuntu.com:80 (185.125.190.36), connection timed out Could not connect to archive.ubuntu.com:80 (185.125.190.39), connection timed out Could not connect to archive.ubuntu.com:80 (91.189.91.38), connection timed out Could not connect to archive.ubuntu.com:80 (91.189.91.39), connection timed out Err:2 bionic-updates InRelease Unable to connect to archive.ubuntu.com:http: Err:3 bionic-backports InRelease Unable to connect to archive.ubuntu.com:http: Err:4 bionic-security InRelease Could not connect to security.ubuntu.com:80 (185.125.190.36), connection timed out Could not connect to security.ubuntu.com:80 (91.189.91.38), connection timed out Could not connect to security.ubuntu.com:80 (91.189.91.39), connection timed out Could not connect to security.ubuntu.com:80 (185.125.190.39), connection timed out Reading package lists... W: Failed to fetch Could not connect to archive.ubuntu.com:80 (185.125.190.36), connection timed out Could not connect to archive.ubuntu.com:80 (185.125.190.39), connection timed out Could not connect to archive.ubuntu.com:80 (91.189.91.38), connection timed out Could not connect to archive.ubuntu.com:80 (91.189.91.39), connection timed out W: Failed to fetch Unable to connect to archive.ubuntu.com:http: W: Failed to fetch Unable to connect to archive.ubuntu.com:http: W: Failed to fetch Could not connect to security.ubuntu.com:80 (185.125.190.36), connection timed out Could not connect to security.ubuntu.com:80 (91.189.91.38), connection timed out Could not connect to security.ubuntu.com:80 (91.189.91.39), connection timed out Could not connect to security.ubuntu.com:80 (185.125.190.39), connection timed out W: Some index files failed to download. They have been ignored, or old ones used instead. Reading package lists... Building dependency tree... Reading state information... E: Unable to locate package curl E: Unable to locate package python3-pip E: Unable to locate package git /bin/sh: 1: python3: not found /bin/sh: 1: python3: not found /bin/sh: 1: clearml-agent: not found

  
  
Posted 2 years ago

I think it's because the proxy env var are not passed to the container (I thought they were the same as the extraArgs from the agentservice, but it doesn't look like that's the case)

  
  
Posted 2 years ago

Thanks, adding environment variables to the agentservice solved it, but for the agentgroup agent, I can't see any obvious way to inject environment variables. In the helm chart template I don't see any way to pass custom environment variables to the pod

  
  
Posted 2 years ago

Pretty confusing that neither

services

StickyLizard47 basically this is how a services queue agent should be spinned:
https://github.com/allegroai/clearml-server/blob/9b108740da21f25407bd2c59583ca1c86f8e1faa/docker/docker-compose.yml#L123
When spinning on a k8s cluster, this is a bit more complicated, as it needs to work with the clearml-k8s-glue.
See here how to spin it on k8s
https://github.com/allegroai/clearml-agent/tree/master/docker/k8s-glue

  
  
Posted 2 years ago

But as Gaspard was saying, with the default config there is no agent listening to the "k8s_scheduler" queue with the default settings

  
  
Posted 2 years ago

Click on the "k8s_schedule" queue, then on the right hand side, you should see your Task, click on it, it will open the Task page. There click on the "Info" Tab, there look for "STATUS MESSAGE" and "STATUS REASON". What do you have there?

  
  
Posted 2 years ago

If you find out more or get an official explaination, please keep me posted 🙂

  
  
Posted 2 years ago

By the way, after fixing the agentservice issue, and having the pod configured correctly, now I see an error in the agentgroup-cpu pod, because it says that the token is not the correct one:

http://:8081 http://:8080 `

WARNING: Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ConnectTimeoutError(<pip._vendor.urllib3.connection.HTTPSConnection object at 0x7fa4b00438d0>, 'Connection to pypi.org timed out. (connect timeout=15)')': /simple/clearml-agent/

WARNING: Retrying (Retry(total=3, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ConnectTimeoutError(<pip._vendor.urllib3.connection.HTTPSConnection object at 0x7fa4b0043a20>, 'Connection to pypi.org timed out. (connect timeout=15)')': /simple/clearml-agent/

WARNING: Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ConnectTimeoutError(<pip._vendor.urllib3.connection.HTTPSConnection object at 0x7fa4b0043c18>, 'Connection to pypi.org timed out. (connect timeout=15)')': /simple/clearml-agent/

WARNING: Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ConnectTimeoutError(<pip._vendor.urllib3.connection.HTTPSConnection object at 0x7fa4b0043d68>, 'Connection to pypi.org timed out. (connect timeout=15)')': /simple/clearml-agent/

WARNING: Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ConnectTimeoutError(<pip._vendor.urllib3.connection.HTTPSConnection object at 0x7fa4b0043eb8>, 'Connection to pypi.org timed out. (connect timeout=15)')': /simple/clearml-agent/

WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead:

Using environment access key CLEARML_API_ACCESS_KEY=ENP39EQM4SLACGD5FXB7

Using environment secret key CLEARML_API_SECRET_KEY=********

clearml_agent: ERROR: Failed getting token (error 401 from ): Unauthorized (invalid credentials) (failed to locate provided credentials) `

  
  
Posted 2 years ago

think it's because the proxy env var are not passed to the container ...

Yes this seems correct, the errors point to a network issues, i.e. the container does not seem to be able to connect to the clearml-server

  
  
Posted 2 years ago

I will!

  
  
Posted 2 years ago

Martin I told you I can't access the resources in the cluster unfortunately

😞
so it seems there is some misconfiguration of the k8s glue, because we can see it can "talk" to the clearml-server, but it seems it fails to actually create the k8s pod/job. I would start with debugging the k8s glue (not the services agents). Regardless, I think the next step is to get a log of the k8s glue pod, and better understand the issue.
wdyt?

  
  
Posted 2 years ago
23K Views
31 Answers
2 years ago
7 months ago
Tags
Similar posts