Hi Martin, I'll try to get the logs on Monday, though the K8s configuration doesn't "scare" me, I can solve that with my colleagues.
But I'll share it if it helps debug the issue
unfortunately I can't get info from the cluster
so I assume clearml moves them from one queue to the other?
AgitatedDove14 I used the default configuration from the helm chart for the k8s glue.
The way I understand it is that K8s glue agent is enabled by default (and I do see a Deployment for clearml-k8sagent
But as Gaspard was saying, with the default config there is no agent listening to the "k8s_scheduler" queue with the default settings
I can see the outputs from argo, so I know if some resource has been created but I can't inspect the full logs,
the ones I have available are all records similar toNo tasks in queue 80247f703053470fa60718b4dff7a576
This is good news, that means the k8s glue created a k8s job and pushed the Task into the "k8s_scheduler" queue, for visibility (i.e. it is now the k8s job to launch the pod).
Can you check on the Task Info tab what is the status/message ? (it should reflect the k8s pod status)
If you find out more or get an official explaination, please keep me posted 🙂
By the way, after fixing the agentservice issue, and having the pod configured correctly, now I see an error in the agentgroup-cpu pod, because it says that the token is not the correct one:
http://:8081 http://:8080
`
WARNING: Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ConnectTimeoutError(<pip._vendor.urllib3.connection.HTTPSConnection object at 0x7fa4b00438d0>, 'Connection to pypi.org timed out. (connect timeout=15)')': /simple/clearml-agent/
WARNING: Retrying (Retry(total=3, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ConnectTimeoutError(<pip._vendor.urllib3.connection.HTTPSConnection object at 0x7fa4b0043a20>, 'Connection to pypi.org timed out. (connect timeout=15)')': /simple/clearml-agent/
WARNING: Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ConnectTimeoutError(<pip._vendor.urllib3.connection.HTTPSConnection object at 0x7fa4b0043c18>, 'Connection to pypi.org timed out. (connect timeout=15)')': /simple/clearml-agent/
WARNING: Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ConnectTimeoutError(<pip._vendor.urllib3.connection.HTTPSConnection object at 0x7fa4b0043d68>, 'Connection to pypi.org timed out. (connect timeout=15)')': /simple/clearml-agent/
WARNING: Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ConnectTimeoutError(<pip._vendor.urllib3.connection.HTTPSConnection object at 0x7fa4b0043eb8>, 'Connection to pypi.org timed out. (connect timeout=15)')': /simple/clearml-agent/
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead:
Using environment access key CLEARML_API_ACCESS_KEY=ENP39EQM4SLACGD5FXB7
Using environment secret key CLEARML_API_SECRET_KEY=********
clearml_agent: ERROR: Failed getting token (error 401 from ): Unauthorized (invalid credentials) (failed to locate provided credentials) `
Martin I told you I can't access the resources in the cluster unfortunately
Yes, on the GitHub repo readme there is a brief description of the k8s-glue, but that's it.
Pretty confusing that neither services
nor scheduled
work out of the box to be honest
so I assume clearml moves them from one queue to the other?
Correct. When it creates the k8s job and launches it on the cluster it moves it into the queue.
Can you see it on your k8s cluster (meaning the job/pod)?
The way I understand it is that K8s glue agent is enabled by default (and I do see a Deployment for
clearml-k8sagent
SarcasticSquirrel56
Good start, when you say you see the Task in ""k8s_scheduler" queue, originally did you enqueue it to "default" ?
And I see that it is moved to the k8s_scheduler one instead (though I see that in the "default" queue I do have jobs)
Pretty confusing that neither
services
StickyLizard47 basically this is how a services queue agent should be spinned:
https://github.com/allegroai/clearml-server/blob/9b108740da21f25407bd2c59583ca1c86f8e1faa/docker/docker-compose.yml#L123
When spinning on a k8s cluster, this is a bit more complicated, as it needs to work with the clearml-k8s-glue.
See here how to spin it on k8s
https://github.com/allegroai/clearml-agent/tree/master/docker/k8s-glue
Thanks, adding environment variables to the agentservice solved it, but for the agentgroup agent, I can't see any obvious way to inject environment variables. In the helm chart template I don't see any way to pass custom environment variables to the pod
okay this points to an issue with the k8s glue, I think it somehow failed to launch the pod. Can you send me the log of the clearml-k8s-glue ?
After trying Gaspard changes to the helm chart values, I do now see that also a pod for the agentservice is deployed,
And some of the logs point to a misconfigurations on my side (the fact it can't access resources externally),
some others I don't understand:Err:1
bionic InRelease Could not connect to archive.ubuntu.com:80 (185.125.190.36), connection timed out Could not connect to archive.ubuntu.com:80 (185.125.190.39), connection timed out Could not connect to archive.ubuntu.com:80 (91.189.91.38), connection timed out Could not connect to archive.ubuntu.com:80 (91.189.91.39), connection timed out Err:2
bionic-updates InRelease Unable to connect to archive.ubuntu.com:http: Err:3
bionic-backports InRelease Unable to connect to archive.ubuntu.com:http: Err:4
bionic-security InRelease Could not connect to security.ubuntu.com:80 (185.125.190.36), connection timed out Could not connect to security.ubuntu.com:80 (91.189.91.38), connection timed out Could not connect to security.ubuntu.com:80 (91.189.91.39), connection timed out Could not connect to security.ubuntu.com:80 (185.125.190.39), connection timed out Reading package lists... W: Failed to fetch
Could not connect to archive.ubuntu.com:80 (185.125.190.36), connection timed out Could not connect to archive.ubuntu.com:80 (185.125.190.39), connection timed out Could not connect to archive.ubuntu.com:80 (91.189.91.38), connection timed out Could not connect to archive.ubuntu.com:80 (91.189.91.39), connection timed out W: Failed to fetch
Unable to connect to archive.ubuntu.com:http: W: Failed to fetch
Unable to connect to archive.ubuntu.com:http: W: Failed to fetch
Could not connect to security.ubuntu.com:80 (185.125.190.36), connection timed out Could not connect to security.ubuntu.com:80 (91.189.91.38), connection timed out Could not connect to security.ubuntu.com:80 (91.189.91.39), connection timed out Could not connect to security.ubuntu.com:80 (185.125.190.39), connection timed out W: Some index files failed to download. They have been ignored, or old ones used instead. Reading package lists... Building dependency tree... Reading state information... E: Unable to locate package curl E: Unable to locate package python3-pip E: Unable to locate package git /bin/sh: 1: python3: not found /bin/sh: 1: python3: not found /bin/sh: 1: clearml-agent: not found
think it's because the proxy env var are not passed to the container ...
Yes this seems correct, the errors point to a network issues, i.e. the container does not seem to be able to connect to the clearml-server
In the ClearML ui it stays in a Pending state
I think it's because the proxy env var are not passed to the container (I thought they were the same as the extraArgs from the agentservice, but it doesn't look like that's the case)
Yes, I add it to "default" queue (which is the one used in the config file for the k8 glue agent)
StickyLizard47 apologies for the https://github.com/allegroai/clearml-server/issues/140 not being followed (probably slipped through the cracks of backend guys, I can see the 1.5 release happened in parallel). Let me make sure it is followed.
SarcasticSquirrel56 specifically, did you also spin a clearml-k8s glue? or are the agents statically allocated on the helm chart?
Click on the "k8s_schedule" queue, then on the right hand side, you should see your Task, click on it, it will open the Task page. There click on the "Info" Tab, there look for "STATUS MESSAGE" and "STATUS REASON". What do you have there?
is no agent listening to the "k8s_scheduler"
There should not be one, this is purely "virtual" , so users understand the k8s cluster is spinning their pod (sometimes it takes time, imagine EKS etc. , just visibility)
unfortunately I can't get info from the cluster
You should be able the pod in the cluster no?!
What's the Task Info panel say, can you share a screen shot ?
SarcasticSquirrel56 I think that can be quite easily added - can you please open a GitHub issue in https://github.com/allegroai/clearml-helm-charts ?