The way I understand it is that K8s glue agent is enabled by default (and I do see a Deployment for
clearml-k8sagent
SarcasticSquirrel56
Good start, when you say you see the Task in ""k8s_scheduler" queue, originally did you enqueue it to "default" ?
so I assume clearml moves them from one queue to the other?
AgitatedDove14 I used the default configuration from the helm chart for the k8s glue.
The way I understand it is that K8s glue agent is enabled by default (and I do see a Deployment for clearml-k8sagent
unfortunately I can't get info from the cluster
In the ClearML ui it stays in a Pending state
I can see the outputs from argo, so I know if some resource has been created but I can't inspect the full logs,
the ones I have available are all records similar toNo tasks in queue 80247f703053470fa60718b4dff7a576
Hi Martin, I'll try to get the logs on Monday, though the K8s configuration doesn't "scare" me, I can solve that with my colleagues.
But I'll share it if it helps debug the issue
SarcasticSquirrel56 I think that can be quite easily added - can you please open a GitHub issue in https://github.com/allegroai/clearml-helm-charts ?
Yes, on the GitHub repo readme there is a brief description of the k8s-glue, but that's it.
Pretty confusing that neither services
nor scheduled
work out of the box to be honest
This is good news, that means the k8s glue created a k8s job and pushed the Task into the "k8s_scheduler" queue, for visibility (i.e. it is now the k8s job to launch the pod).
Can you check on the Task Info tab what is the status/message ? (it should reflect the k8s pod status)
StickyLizard47 apologies for the https://github.com/allegroai/clearml-server/issues/140 not being followed (probably slipped through the cracks of backend guys, I can see the 1.5 release happened in parallel). Let me make sure it is followed.
SarcasticSquirrel56 specifically, did you also spin a clearml-k8s glue? or are the agents statically allocated on the helm chart?
is no agent listening to the "k8s_scheduler"
There should not be one, this is purely "virtual" , so users understand the k8s cluster is spinning their pod (sometimes it takes time, imagine EKS etc. , just visibility)
unfortunately I can't get info from the cluster
You should be able the pod in the cluster no?!
What's the Task Info panel say, can you share a screen shot ?
okay this points to an issue with the k8s glue, I think it somehow failed to launch the pod. Can you send me the log of the clearml-k8s-glue ?
Martin I told you I can't access the resources in the cluster unfortunately
And I see that it is moved to the k8s_scheduler one instead (though I see that in the "default" queue I do have jobs)
Yes, I add it to "default" queue (which is the one used in the config file for the k8 glue agent)
After trying Gaspard changes to the helm chart values, I do now see that also a pod for the agentservice is deployed,
And some of the logs point to a misconfigurations on my side (the fact it can't access resources externally),
some others I don't understand:Err:1
bionic InRelease Could not connect to archive.ubuntu.com:80 (185.125.190.36), connection timed out Could not connect to archive.ubuntu.com:80 (185.125.190.39), connection timed out Could not connect to archive.ubuntu.com:80 (91.189.91.38), connection timed out Could not connect to archive.ubuntu.com:80 (91.189.91.39), connection timed out Err:2
bionic-updates InRelease Unable to connect to archive.ubuntu.com:http: Err:3
bionic-backports InRelease Unable to connect to archive.ubuntu.com:http: Err:4
bionic-security InRelease Could not connect to security.ubuntu.com:80 (185.125.190.36), connection timed out Could not connect to security.ubuntu.com:80 (91.189.91.38), connection timed out Could not connect to security.ubuntu.com:80 (91.189.91.39), connection timed out Could not connect to security.ubuntu.com:80 (185.125.190.39), connection timed out Reading package lists... W: Failed to fetch
Could not connect to archive.ubuntu.com:80 (185.125.190.36), connection timed out Could not connect to archive.ubuntu.com:80 (185.125.190.39), connection timed out Could not connect to archive.ubuntu.com:80 (91.189.91.38), connection timed out Could not connect to archive.ubuntu.com:80 (91.189.91.39), connection timed out W: Failed to fetch
Unable to connect to archive.ubuntu.com:http: W: Failed to fetch
Unable to connect to archive.ubuntu.com:http: W: Failed to fetch
Could not connect to security.ubuntu.com:80 (185.125.190.36), connection timed out Could not connect to security.ubuntu.com:80 (91.189.91.38), connection timed out Could not connect to security.ubuntu.com:80 (91.189.91.39), connection timed out Could not connect to security.ubuntu.com:80 (185.125.190.39), connection timed out W: Some index files failed to download. They have been ignored, or old ones used instead. Reading package lists... Building dependency tree... Reading state information... E: Unable to locate package curl E: Unable to locate package python3-pip E: Unable to locate package git /bin/sh: 1: python3: not found /bin/sh: 1: python3: not found /bin/sh: 1: clearml-agent: not found
I think it's because the proxy env var are not passed to the container (I thought they were the same as the extraArgs from the agentservice, but it doesn't look like that's the case)
Thanks, adding environment variables to the agentservice solved it, but for the agentgroup agent, I can't see any obvious way to inject environment variables. In the helm chart template I don't see any way to pass custom environment variables to the pod
Pretty confusing that neither
services
StickyLizard47 basically this is how a services queue agent should be spinned:
https://github.com/allegroai/clearml-server/blob/9b108740da21f25407bd2c59583ca1c86f8e1faa/docker/docker-compose.yml#L123
When spinning on a k8s cluster, this is a bit more complicated, as it needs to work with the clearml-k8s-glue.
See here how to spin it on k8s
https://github.com/allegroai/clearml-agent/tree/master/docker/k8s-glue
But as Gaspard was saying, with the default config there is no agent listening to the "k8s_scheduler" queue with the default settings
Click on the "k8s_schedule" queue, then on the right hand side, you should see your Task, click on it, it will open the Task page. There click on the "Info" Tab, there look for "STATUS MESSAGE" and "STATUS REASON". What do you have there?
If you find out more or get an official explaination, please keep me posted 🙂
By the way, after fixing the agentservice issue, and having the pod configured correctly, now I see an error in the agentgroup-cpu pod, because it says that the token is not the correct one:
http://:8081 http://:8080
`
WARNING: Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ConnectTimeoutError(<pip._vendor.urllib3.connection.HTTPSConnection object at 0x7fa4b00438d0>, 'Connection to pypi.org timed out. (connect timeout=15)')': /simple/clearml-agent/
WARNING: Retrying (Retry(total=3, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ConnectTimeoutError(<pip._vendor.urllib3.connection.HTTPSConnection object at 0x7fa4b0043a20>, 'Connection to pypi.org timed out. (connect timeout=15)')': /simple/clearml-agent/
WARNING: Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ConnectTimeoutError(<pip._vendor.urllib3.connection.HTTPSConnection object at 0x7fa4b0043c18>, 'Connection to pypi.org timed out. (connect timeout=15)')': /simple/clearml-agent/
WARNING: Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ConnectTimeoutError(<pip._vendor.urllib3.connection.HTTPSConnection object at 0x7fa4b0043d68>, 'Connection to pypi.org timed out. (connect timeout=15)')': /simple/clearml-agent/
WARNING: Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ConnectTimeoutError(<pip._vendor.urllib3.connection.HTTPSConnection object at 0x7fa4b0043eb8>, 'Connection to pypi.org timed out. (connect timeout=15)')': /simple/clearml-agent/
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead:
Using environment access key CLEARML_API_ACCESS_KEY=ENP39EQM4SLACGD5FXB7
Using environment secret key CLEARML_API_SECRET_KEY=********
clearml_agent: ERROR: Failed getting token (error 401 from ): Unauthorized (invalid credentials) (failed to locate provided credentials) `
think it's because the proxy env var are not passed to the container ...
Yes this seems correct, the errors point to a network issues, i.e. the container does not seem to be able to connect to the clearml-server
Martin I told you I can't access the resources in the cluster unfortunately
😞
so it seems there is some misconfiguration of the k8s glue, because we can see it can "talk" to the clearml-server, but it seems it fails to actually create the k8s pod/job. I would start with debugging the k8s glue (not the services agents). Regardless, I think the next step is to get a log of the k8s glue pod, and better understand the issue.
wdyt?