Hi There, Our Team Started Using Clearml A Few Months Ago And We'Ve Recently Deployed An Aws Eks K8S Cluster With The Hopes Of Deploying A Clearml-Agent. I'Ve Been Able To Install The Agent On The Cluster Using:

Answered

Hi there,

Our team started using clearml a few months ago and we've recently deployed an AWS EKS k8s cluster with the hopes of deploying a clearml-agent. I've been able to install the agent on the cluster using:

helm install clearml-agent-gpu allegroai/clearml-agent \
    --set clearml.agentk8sglueKey=<removed> \
    --set clearml.agentk8sglueSecret=<removed> \
    --set agentk8sglue.defaultContainerImage="python:3.11-bullseye" \
    --set agentk8sglue.nodeSelector.nodegroup="clearml-agent" \
    --set agentk8sglue.queue="gpu-queue" \
    --set agentk8sglue.basePodTemplate.nodeSelector.nodegroup="clearml-gpu" \
    --set agentk8sglue.basePodTemplate.env[0].name=CLEARML_AGENT_GIT_USER \
    --set agentk8sglue.basePodTemplate.env[0].value=username \
    --set agentk8sglue.basePodTemplate.env[1].name=CLEARML_AGENT_GIT_PASS \
    --set agentk8sglue.basePodTemplate.env[1].valueFrom.secretKeyRef.name=git-password \
    --set agentk8sglue.basePodTemplate.env[1].valueFrom.secretKeyRef.key=git-password \
    --set clearml.clearmlConfig="
    agent {
      package_manager: {
        type: poetry;
        poetry_version: 1.8.2
      }
      force_git_ssh_protocol: false;
      disable_requirements_auto_install: true
    }"

I'm currently encountering two issues:

--set agentk8sglue.defaultContainerImage="python:3.11-bullseye" does not seem to change the container that gets used. Here are the logs from the pod:

Executing task id [9296dbfe38384daf958911e9155a8bca]:
repository = git@gitlab.com:<company_git_repo>.git
branch = <branch_name>
version_num = f0052fa186cab812a4aa07c05e088d466eb41ff7
tag =
docker_cmd = ubuntu:18.04
entry_point = main.py
working_dir = scope-ml/scope_mllib/training

Python executable with version '3.11' requested by the Task, not found in path, using '/usr/bin/python3' (v3.6.9) instead

It still tries to use 'ubuntu:18.04', am I doing this correctly?
2. I've created a k8s secret with the gitlab personal access token, but it seems like it is still unable to git pull the repo that is needed. Here are the logs from the pod:

cloning: git@gitlab.com:<company_git_repo>.git
Using user/pass credentials - replacing ssh url 'git@gitlab.com:<company_git_repo>.git' with https url '

<company_git_repo>.git'

pulling git
Using SSH credentials - replacing https url '

<company_git_repo>.git' with ssh url '

<company_git_repo>.git'
fatal: could not read Username for '

': terminal prompts disabled
error: Could not fetch origin
git pull failed: Command '['git', 'fetch', '--all', '--tags', '--recurse-submodules']' returned non-zero exit status 1.
Repository cloning failed: Command '['git', 'fetch', '--all', '--tags', '--recurse-submodules']' returned non-zero exit status 1.
Task failed: stopping task (4) exception

Any assistance would be much appreciated thanks!!

  				
Posted 
	6 months ago

					More  		
  Report
		
					AlertReindeer55
				
					0
					 × 1

Votes Newest

Answers 5

AlertReindeer55 hi! Were you able to fix the second issue?

  				
Posted 
	4 months ago

					More  		
  Report
		
					VexedWoodpecker50
				
					0
					 × 1

CostlyOstrich36 , my container section is completely empty and unspecified.

The only place I can see "ubuntu:18.04" being specified is in the clearml-agent helm chart defaults ( None ), but the whole point of me runnining --set agentk8sglue.defaultContainerImage="python:3.11-bullseye" is that it's supposed to override that default

  				
Posted 
	6 months ago

					More  		
  Report
		
					AlertReindeer55
				
					0
					 × 1

AlertReindeer55 , I think what SuccessfulKoala55 means is that you can set the docker image on the experiment level itself as well. If you go into the "EXECUTION" tab of the experiment, in the container section you might see an image there

  				
Posted 
	6 months ago

					More  		
  Report
		
					CostlyOstrich36
				
					0

SuccessfulKoala55 , where/how is this specified, because we are not setting this image anywhere. We are trying to override with a different image "python:3.11-bullseye". If my current way of overriding is incorrect, what is the correct way of doing this?

  				
Posted 
	6 months ago

					More  		
  Report
		
					AlertReindeer55
				
					0
					 × 1

Hi AlertReindeer55 ,
This:

Executing task id [9296dbfe38384daf958911e9155a8bca]:
repository = git@gitlab.com:<company_git_repo>.git
branch = <branch_name>
version_num = f0052fa186cab812a4aa07c05e088d466eb41ff7
tag =
docker_cmd = ubuntu:18.04
entry_point = main.py
working_dir = scope-ml/scope_mllib/training

Basically says the agent found the ubuntu:18.04 image specified on the task itself , which will always override any default container setting

  				
Posted 
	6 months ago

					More  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Write your answer

502 Views

5 Answers

6 months ago

4 months ago