Hello All, I Installed Self-Hosted Server And Queue(Cosumes 1 Gpu) On Kubernetes. I Have An Issue Regarding Gpu Monitoring. I Checked The Process Is Using Gpu In The Pod, But Gpu Usage Is Not Being Displayed On Workers & Queues Dashboard, Whereas Cpu Usag

Answered

Hello all,
I installed self-hosted server and queue(cosumes 1 gpu) on kubernetes.
I have an issue regarding gpu monitoring.
I checked the process is using gpu in the pod, but gpu usage is not being displayed on WORKERS & QUEUES Dashboard, whereas CPU usage is. what is wrong?

  				
Posted 
	one year ago

					More  		
  Report
		
					TartLeopard58
				
					0
					 × 1

Votes Newest

Answers 40

I set CLEARML_AGENT_UPDATE_VERSION=1.5.3rc2 ` in agentk8sglue.basePodTemplate.env as i mentioned

  				
Posted 
	one year ago

					More  		
  Report
		
					TartLeopard58
				
					0
					 × 1

I tried using K8S_GLUE_POD_AGENT_INSTALL_ARGS=1.5.3rc2 instead of CLEARML_AGENT_UPDATE_VERSION=1.5.3rc2 , but it’s same. doesn’t read gpu usage.. 🥲

  				
Posted 
	one year ago

					More  		
  Report
		
					TartLeopard58
				
					0
					 × 1

for more info, I set CLEARML_AGENT_UPDATE_VERSION=1.5.3rc2 ` in agentk8sglue.basePodTemplate.env

  				
Posted 
	one year ago

					More  		
  Report
		
					TartLeopard58
				
					0
					 × 1

SuccessfulKoala55 yes. It only occurs when running on the cloud. It’s fine when running on-premises.

  				
Posted 
	one year ago

					More  		
  Report
		
					TartLeopard58
				
					0
					 × 1

nope. just running “clearml-agent daemon --queue shelley”

  				
Posted 
	one year ago

					More  		
  Report
		
					TartLeopard58
				
					0
					 × 1

Oh, It’s not the issue with eks.. We had the same issue on an on-premise cluster too(clearml-agent is installed). Could it be because of clearml-agent installed?

  				
Posted 
	one year ago

					More  		
  Report
		
					TartLeopard58
				
					0
					 × 1

is that the working scenario?

  				
Posted 
	one year ago

					More  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

but it's running, isn't it?

  				
Posted 
	one year ago

					More  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

heres is the log when executing with --foreground. but is there any difference?

  				
Posted 
	one year ago

					More  		
  Report
		
					TartLeopard58
				
					0
					 × 1

on-premises using an agent?

  				
Posted 
	one year ago

					More  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

SuccessfulKoala55 what is task log? you mean the pod log provisioned by clearml-agent? do you want me to show them?

  				
Posted 
	one year ago

					More  		
  Report
		
					TartLeopard58
				
					0
					 × 1

I run clearml-agent manually in gpu available pod using command clearml-agent daemon --queue shelley
and this doesn’t show gpu usage same with when i run task remotely

and here is the log

agent.worker_id =
agent.worker_name = shelley-gpu-pod
agent.force_git_ssh_protocol = false
agent.python_binary =
agent.package_manager.type = pip
agent.package_manager.pip_version.0 = <20.2 ; python_version < ‘3.10’
agent.package_manager.pip_version.1 = <22.3 ; python_version >= ‘3.10’
agent.package_manager.system_site_packages = false
agent.package_manager.force_upgrade = false
agent.package_manager.conda_channels.0 = pytorch
agent.package_manager.conda_channels.1 = conda-forge
agent.package_manager.conda_channels.2 = defaults
agent.package_manager.priority_optional_packages.0 = pygobject
agent.package_manager.torch_nightly = false
agent.package_manager.poetry_files_from_repo_working_dir = false
agent.venvs_dir = /root/.clearml/venvs-builds
agent.venvs_cache.max_entries = 10
agent.venvs_cache.free_space_threshold_gb = 2.0
agent.venvs_cache.path = ~/.clearml/venvs-cache
agent.vcs_cache.enabled = true
agent.vcs_cache.path = /root/.clearml/vcs-cache
agent.venv_update.enabled = false
agent.pip_download_cache.enabled = true
agent.pip_download_cache.path = /root/.clearml/pip-download-cache
agent.translate_ssh = true
agent.reload_config = false
agent.docker_pip_cache = /root/.clearml/pip-cache
agent.docker_apt_cache = /root/.clearml/apt-cache
agent.docker_force_pull = false
agent.default_docker.image = nvidia/cuda:10.2-cudnn7-runtime-ubuntu18.04
agent.enable_task_env = false
agent.hide_docker_command_env_vars.enabled = true
agent.hide_docker_command_env_vars.parse_embedded_urls = true
agent.abort_callback_max_timeout = 1800
agent.docker_internal_mounts.sdk_cache = /clearml_agent_cache
agent.docker_internal_mounts.apt_cache = /var/cache/apt/archives
agent.docker_internal_mounts.ssh_folder = ~/.ssh
agent.docker_internal_mounts.ssh_ro_folder = /.ssh
agent.docker_internal_mounts.pip_cache = /root/.cache/pip
agent.docker_internal_mounts.poetry_cache = /root/.cache/pypoetry
agent.docker_internal_mounts.vcs_cache = /root/.clearml/vcs-cache
agent.docker_internal_mounts.venv_build = ~/.clearml/venvs-builds
agent.docker_internal_mounts.pip_download = /root/.clearml/pip-download-cache
agent.apply_environment = true
agent.apply_files = true
agent.custom_build_script =
agent.disable_task_docker_override = false
agent.default_python = 3.8
agent.cuda_version = 110
agent.cudnn_version = 0
api.version = 1.5
api.verify_certificate = true
api.default_version = 1.5
api.http.max_req_size = 15728640
api.http.retries.total = 240
api.http.retries.connect = 240
api.http.retries.read = 240
api.http.retries.redirect = 240
api.http.retries.status = 240
api.http.retries.backoff_factor = 1.0
api.http.retries.backoff_max = 120.0
api.http.wait_on_maintenance_forever = true
api.http.pool_maxsize = 512
api.http.pool_connections = 512
api.http.default_method = post
api.api_server = “”
api.files_server = “”
api.web_server = “”
api.credentials.access_key = “”
sdk.storage.cache.default_base_dir = ~/.clearml/cache
sdk.storage.cache.size.min_free_bytes = 10GB
sdk.storage.direct_access.0.url = file://*
sdk.metrics.file_history_size = 100
sdk.metrics.matplotlib_untitled_history_size = 100
sdk.metrics.images.format = JPEG
sdk.metrics.images.quality = 87
sdk.metrics.images.subsampling = 0
sdk.metrics.tensorboard_single_series_per_graph = false
sdk.network.metrics.file_upload_threads = 4
sdk.network.metrics.file_upload_starvation_warning_sec = 120
sdk.network.iteration.max_retries_on_server_error = 5
sdk.network.iteration.retry_backoff_factor_sec = 10
sdk.aws.s3.key =
sdk.aws.s3.region =
sdk.aws.boto3.pool_connections = 512
sdk.aws.boto3.max_multipart_concurrency = 16
sdk.log.null_log_propagate = false
sdk.log.task_log_buffer_capacity = 66
sdk.log.disable_urllib3_info = true
sdk.development.task_reuse_time_window_in_hours = 72.0
sdk.development.vcs_repo_detect_async = true
sdk.development.store_uncommitted_code_diff = true
sdk.development.support_stopping = true
sdk.development.default_output_uri =
sdk.development.force_analyze_entire_repo = false
sdk.development.suppress_update_message = false
sdk.development.detect_with_pip_freeze = false
sdk.development.worker.report_period_sec = 2
sdk.development.worker.ping_period_sec = 30
sdk.development.worker.log_stdout = true
sdk.development.worker.report_global_mem_used = false

  				
Posted 
	one year ago

					More  		
  Report
		
					TartLeopard58
				
					0
					 × 1

TartLeopard58 the agent running the task is v1.5.2 (as shown in the log), so the whole point is lost - we need to see the v1.5.3rc2 or v1.5.3rc3 running there... how did you set up the helm chart for the new agent?

  				
Posted 
	one year ago

					More  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

any updates?..

  				
Posted 
	one year ago

					More  		
  Report
		
					TartLeopard58
				
					0
					 × 1

Also, can you run it with --foreground and send the log again?

  				
Posted 
	one year ago

					More  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

I tried the suggestion you mentioned, but it’s the same. And it doesn’t seem to be an AMI issue. The same problem is occurring even in an on-premise environment.

  				
Posted 
	one year ago

					More  		
  Report
		
					TartLeopard58
				
					0
					 × 1

Hi again 😊 SuccessfulKoala55 sure!

  				
Posted 
	one year ago

					More  		
  Report
		
					TartLeopard58
				
					0
					 × 1

It also shows on project detail page.

  				
Posted 
	one year ago

					More  		
  Report
		
					TartLeopard58
				
					0
					 × 1

Hi TartLeopard58 , how is the agent running? is it running in k8s as well?

  				
Posted 
	one year ago

					More  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Can you share the k8s logs of the pod running the agent? (not the task)

  				
Posted 
	one year ago

					More  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

TartLeopard58 this might be related to the specific AMI/docker image in which the agent is running... can you use agent version v 1.5.3rc2 ?

  				
Posted 
	one year ago

					More  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

You can do that by passing the CLEARML_AGENT_UPDATE_VERSION=1.5.3rc2 env var

  				
Posted 
	one year ago

					More  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

sure! this one?

  				
Posted 
	one year ago

					More  		
  Report
		
					TartLeopard58
				
					0
					 × 1

SuccessfulKoala55 Okay..but how can i specify agent’s verison in helm chart?

  				
Posted 
	one year ago

					More  		
  Report
		
					TartLeopard58
				
					0
					 × 1

Is that the whole log?

  				
Posted 
	one year ago

					More  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

This is clearml-agent helm chart values.yaml file i used to install

  				
Posted 
	one year ago

					More  		
  Report
		
					TartLeopard58
				
					0
					 × 1

Try using K8S_GLUE_POD_AGENT_INSTALL_ARGS=1.5.3rc2

  				
Posted 
	one year ago

					More  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

This was so we can see the agent log

  				
Posted 
	one year ago

					More  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

here is the agent, task log file~!

  				
Posted 
	one year ago

					More  		
  Report
		
					TartLeopard58
				
					0
					 × 1

Can you share it?

  				
Posted 
	one year ago

					More  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Show more results

Write your answer

83K Views

40 Answers

one year ago