I'm trying to use K8s-glue agent, to do so, I've followed the next steps:
- Created NS
clearml
- Created secret from template:
apiVersion: v1
kind: Secret
metadata:
name: k8s-glue-pod-template
stringData:
pod_template.yml: |
apiVersion: v1
metadata:
namespace: clearml
spec:
containers:
- resources:
limits:
cpu: 1000m
memory: 4G
requests:
cpu: 1000m
memory: 4G
restartPolicy: Never
- created service account which allow control
clearml
NS:
apiVersion: v1
kind: ServiceAccount
metadata:
name: clearml-service-account
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: pod-manager-role
rules:
- apiGroups: [""]
resources: ["pods"]
verbs: ["get", "watch", "list", "create", "delete"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: pod-manager-rolebinding
subjects:
- kind: ServiceAccount
name: clearml-service-account
namespace: clearml
roleRef:
kind: Role
name: pod-manager-role
apiGroup: rbac.authorization.k8s.io
- install the pod:
apiVersion: v1
kind: Pod
metadata:
name: k8s-glue
spec:
serviceAccountName: "clearml-service-account"
containers:
- name: k8s-glue-container
image: allegroai/clearml-agent-k8s:base-1.21
imagePullPolicy: Always
command: [
"/bin/bash",
"-c",
"source /root/.bashrc && /root/entrypoint.sh"
]
volumeMounts:
- name: pod-template
mountPath: /root/template
env:
- name: CLEARML_API_HOST
value: "
" #Custom-port!
- name: CLEARML_WEB_HOST
value: "
"
- name: CLEARML_FILES_HOST
value: "
"
# - name: K8S_GLUE_MAX_PODS
# value: "2"
- name: K8S_GLUE_QUEUE
value: "k8s-glue"
- name: K8S_GLUE_EXTRA_ARGS
value: "--template-yaml /root/template/pod_template.yml"
- name: CLEARML_API_ACCESS_KEY
value: "***"
- name: CLEARML_API_SECRET_KEY
value: "***"
- name: CLEARML_WORKER_ID
value: "k8s-glue-agent"
- name: CLEARML_AGENT_UPDATE_REPO
value: ""
- name: FORCE_CLEARML_AGENT_REPO
value: ""
- name: CLEARML_DOCKER_IMAGE
value: "ubuntu:22.04"
volumes:
- name: pod-template
secret:
secretName: k8s-glue-pod-template
After pushing first experiment I'm getting this error:
Ex: Expecting value: line 1 column 1 (char 0)
Failed deleting completed/failed pods for ns clearml: Command '['bash', '-c', 'kubectl delete pod -l=CLEARML=agent-74b23a8f --namespace=clearml --field-selector=status.phase!=Pending,status.phase!=Running --output name']' returned non-zero exit status 127.
Even after dequeue the experiment and keep the queue clean, the error keep looping.
What can be done here?
In the ui nothing that can help, the console show:
task a9e29945e78c43b28a9d8d1fcb2f088f pulled from 4a6c8de54dbe4fb0ae7f979611637a01 by worker k8s-glue-agent