Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hello All, I Installed Self-Hosted Server And Queue(Cosumes 1 Gpu) On Kubernetes. I Have An Issue Regarding Gpu Monitoring. I Checked The Process Is Using Gpu In The Pod, But Gpu Usage Is Not Being Displayed On Workers & Queues Dashboard, Whereas Cpu Usag

Hello all,
I installed self-hosted server and queue(cosumes 1 gpu) on kubernetes.
I have an issue regarding gpu monitoring.
I checked the process is using gpu in the pod, but gpu usage is not being displayed on WORKERS & QUEUES Dashboard, whereas CPU usage is. what is wrong?
image

  
  
Posted one year ago
Votes Newest

Answers 40


Can you share the k8s logs of the pod running the agent? (not the task)

  
  
Posted one year ago

is that the working scenario?

  
  
Posted one year ago

This was so we can see the agent log

  
  
Posted one year ago

It also shows on project detail page.
image

  
  
Posted one year ago

Try using K8S_GLUE_POD_AGENT_INSTALL_ARGS=1.5.3rc2

  
  
Posted one year ago

I run clearml-agent manually in gpu available pod using command clearml-agent daemon --queue shelley
and this doesn’t show gpu usage same with when i run task remotely

and here is the log

agent.worker_id =
agent.worker_name = shelley-gpu-pod
agent.force_git_ssh_protocol = false
agent.python_binary =
agent.package_manager.type = pip
agent.package_manager.pip_version.0 = <20.2 ; python_version < ‘3.10’
agent.package_manager.pip_version.1 = <22.3 ; python_version >= ‘3.10’
agent.package_manager.system_site_packages = false
agent.package_manager.force_upgrade = false
agent.package_manager.conda_channels.0 = pytorch
agent.package_manager.conda_channels.1 = conda-forge
agent.package_manager.conda_channels.2 = defaults
agent.package_manager.priority_optional_packages.0 = pygobject
agent.package_manager.torch_nightly = false
agent.package_manager.poetry_files_from_repo_working_dir = false
agent.venvs_dir = /root/.clearml/venvs-builds
agent.venvs_cache.max_entries = 10
agent.venvs_cache.free_space_threshold_gb = 2.0
agent.venvs_cache.path = ~/.clearml/venvs-cache
agent.vcs_cache.enabled = true
agent.vcs_cache.path = /root/.clearml/vcs-cache
agent.venv_update.enabled = false
agent.pip_download_cache.enabled = true
agent.pip_download_cache.path = /root/.clearml/pip-download-cache
agent.translate_ssh = true
agent.reload_config = false
agent.docker_pip_cache = /root/.clearml/pip-cache
agent.docker_apt_cache = /root/.clearml/apt-cache
agent.docker_force_pull = false
agent.default_docker.image = nvidia/cuda:10.2-cudnn7-runtime-ubuntu18.04
agent.enable_task_env = false
agent.hide_docker_command_env_vars.enabled = true
agent.hide_docker_command_env_vars.parse_embedded_urls = true
agent.abort_callback_max_timeout = 1800
agent.docker_internal_mounts.sdk_cache = /clearml_agent_cache
agent.docker_internal_mounts.apt_cache = /var/cache/apt/archives
agent.docker_internal_mounts.ssh_folder = ~/.ssh
agent.docker_internal_mounts.ssh_ro_folder = /.ssh
agent.docker_internal_mounts.pip_cache = /root/.cache/pip
agent.docker_internal_mounts.poetry_cache = /root/.cache/pypoetry
agent.docker_internal_mounts.vcs_cache = /root/.clearml/vcs-cache
agent.docker_internal_mounts.venv_build = ~/.clearml/venvs-builds
agent.docker_internal_mounts.pip_download = /root/.clearml/pip-download-cache
agent.apply_environment = true
agent.apply_files = true
agent.custom_build_script =
agent.disable_task_docker_override = false
agent.default_python = 3.8
agent.cuda_version = 110
agent.cudnn_version = 0
api.version = 1.5
api.verify_certificate = true
api.default_version = 1.5
api.http.max_req_size = 15728640
api.http.retries.total = 240
api.http.retries.connect = 240
api.http.retries.read = 240
api.http.retries.redirect = 240
api.http.retries.status = 240
api.http.retries.backoff_factor = 1.0
api.http.retries.backoff_max = 120.0
api.http.wait_on_maintenance_forever = true
api.http.pool_maxsize = 512
api.http.pool_connections = 512
api.http.default_method = post
api.api_server = “”
api.files_server = “”
api.web_server = “”
api.credentials.access_key = “”
sdk.storage.cache.default_base_dir = ~/.clearml/cache
sdk.storage.cache.size.min_free_bytes = 10GB
sdk.storage.direct_access.0.url = file://*
sdk.metrics.file_history_size = 100
sdk.metrics.matplotlib_untitled_history_size = 100
sdk.metrics.images.format = JPEG
sdk.metrics.images.quality = 87
sdk.metrics.images.subsampling = 0
sdk.metrics.tensorboard_single_series_per_graph = false
sdk.network.metrics.file_upload_threads = 4
sdk.network.metrics.file_upload_starvation_warning_sec = 120
sdk.network.iteration.max_retries_on_server_error = 5
sdk.network.iteration.retry_backoff_factor_sec = 10
sdk.aws.s3.key =
sdk.aws.s3.region =
sdk.aws.boto3.pool_connections = 512
sdk.aws.boto3.max_multipart_concurrency = 16
sdk.log.null_log_propagate = false
sdk.log.task_log_buffer_capacity = 66
sdk.log.disable_urllib3_info = true
sdk.development.task_reuse_time_window_in_hours = 72.0
sdk.development.vcs_repo_detect_async = true
sdk.development.store_uncommitted_code_diff = true
sdk.development.support_stopping = true
sdk.development.default_output_uri =
sdk.development.force_analyze_entire_repo = false
sdk.development.suppress_update_message = false
sdk.development.detect_with_pip_freeze = false
sdk.development.worker.report_period_sec = 2
sdk.development.worker.ping_period_sec = 30
sdk.development.worker.log_stdout = true
sdk.development.worker.report_global_mem_used = false

  
  
Posted one year ago

@<1523701087100473344:profile|SuccessfulKoala55> what is task log? you mean the pod log provisioned by clearml-agent? do you want me to show them?

  
  
Posted one year ago

root@shelley-gpu-pod:/# clearml-agent daemon --queue shelley2 --foreground
/usr/local/lib/python3.8/dist-packages/requests/init.py:109: RequestsDependencyWarning: urllib3 (2.0.2) or chardet (None)/charset_normalizer (3.1.0) doesn’t match a supported version!
warnings.warn(
Using environment access key CLEARML_API_ACCESS_KEY=“”
Using environment secret key CLEARML_API_SECRET_KEY=********
Current configuration (clearml_agent v1.5.2, location: None):

agent.worker_id =
agent.worker_name = shelley-gpu-pod
agent.force_git_ssh_protocol = false
agent.python_binary =
agent.package_manager.type = pip
agent.package_manager.pip_version.0 = <20.2 ; python_version < ‘3.10’
agent.package_manager.pip_version.1 = <22.3 ; python_version >= ‘3.10’
agent.package_manager.system_site_packages = false
agent.package_manager.force_upgrade = false
agent.package_manager.conda_channels.0 = pytorch
agent.package_manager.conda_channels.1 = conda-forge
agent.package_manager.conda_channels.2 = defaults
agent.package_manager.priority_optional_packages.0 = pygobject
agent.package_manager.torch_nightly = false
agent.package_manager.poetry_files_from_repo_working_dir = false
agent.venvs_dir = /root/.clearml/venvs-builds
agent.venvs_cache.max_entries = 10
agent.venvs_cache.free_space_threshold_gb = 2.0
agent.venvs_cache.path = ~/.clearml/venvs-cache
agent.vcs_cache.enabled = true
agent.vcs_cache.path = /root/.clearml/vcs-cache
agent.venv_update.enabled = false
agent.pip_download_cache.enabled = true
agent.pip_download_cache.path = /root/.clearml/pip-download-cache
agent.translate_ssh = true
agent.reload_config = false
agent.docker_pip_cache = /root/.clearml/pip-cache
agent.docker_apt_cache = /root/.clearml/apt-cache
agent.docker_force_pull = false
agent.default_docker.image = nvidia/cuda:10.2-cudnn7-runtime-ubuntu18.04
agent.enable_task_env = false
agent.hide_docker_command_env_vars.enabled = true
agent.hide_docker_command_env_vars.parse_embedded_urls = true
agent.abort_callback_max_timeout = 1800
agent.docker_internal_mounts.sdk_cache = /clearml_agent_cache
agent.docker_internal_mounts.apt_cache = /var/cache/apt/archives
agent.docker_internal_mounts.ssh_folder = ~/.ssh
agent.docker_internal_mounts.ssh_ro_folder = /.ssh
agent.docker_internal_mounts.pip_cache = /root/.cache/pip
agent.docker_internal_mounts.poetry_cache = /root/.cache/pypoetry
agent.docker_internal_mounts.vcs_cache = /root/.clearml/vcs-cache
agent.docker_internal_mounts.venv_build = ~/.clearml/venvs-builds
agent.docker_internal_mounts.pip_download = /root/.clearml/pip-download-cache
agent.apply_environment = true
agent.apply_files = true
agent.custom_build_script =
agent.disable_task_docker_override = false
agent.default_python = 3.8
agent.cuda_version = 110
agent.cudnn_version = 0
api.version = 1.5
api.verify_certificate = true
api.default_version = 1.5
api.http.max_req_size = 15728640
api.http.retries.total = 240
api.http.retries.connect = 240
api.http.retries.read = 240
api.http.retries.redirect = 240
api.http.retries.status = 240
api.http.retries.backoff_factor = 1.0
api.http.retries.backoff_max = 120.0
api.http.wait_on_maintenance_forever = true
api.http.pool_maxsize = 512
api.http.pool_connections = 512
api.http.default_method = post
api.api_server = “”
api.files_server = “”
api.web_server = “”
api.credentials.access_key = “”
sdk.storage.cache.default_base_dir = ~/.clearml/cache
sdk.storage.cache.size.min_free_bytes = 10GB
sdk.storage.direct_access.0.url = file://*
sdk.metrics.file_history_size = 100
sdk.metrics.matplotlib_untitled_history_size = 100
sdk.metrics.images.format = JPEG
sdk.metrics.images.quality = 87
sdk.metrics.images.subsampling = 0
sdk.metrics.tensorboard_single_series_per_graph = false
sdk.network.metrics.file_upload_threads = 4
sdk.network.metrics.file_upload_starvation_warning_sec = 120
sdk.network.iteration.max_retries_on_server_error = 5
sdk.network.iteration.retry_backoff_factor_sec = 10
sdk.aws.s3.key =
sdk.aws.s3.region =
sdk.aws.boto3.pool_connections = 512
sdk.aws.boto3.max_multipart_concurrency = 16
sdk.log.null_log_propagate = false
sdk.log.task_log_buffer_capacity = 66
sdk.log.disable_urllib3_info = true
sdk.development.task_reuse_time_window_in_hours = 72.0
sdk.development.vcs_repo_detect_async = true
sdk.development.store_uncommitted_code_diff = true
sdk.development.support_stopping = true
sdk.development.default_output_uri =
sdk.development.force_analyze_entire_repo = false
sdk.development.suppress_update_message = false
sdk.development.detect_with_pip_freeze = false
sdk.development.worker.report_period_sec = 2
sdk.development.worker.ping_period_sec = 30
sdk.development.worker.log_stdout = true
sdk.development.worker.report_global_mem_used = false

Worker “shelley-gpu-pod:gpuGPU-207fb7da-a426-da69-7b06-3bc1ec482b7e” - Listening to queues:
+----------------------------------+----------+-------+
| id | name | tags |
+----------------------------------+----------+-------+
| 2b8bfe0dc9ae4a9f9d7bd0dd4f00ae16 | shelley2 | |
+----------------------------------+----------+-------+

No tasks in queue 2b8bfe0dc9ae4a9f9d7bd0dd4f00ae16
No tasks in Queues, sleeping for 5.0 seconds

  
  
Posted one year ago

Hi @<1524922424720625664:profile|TartLeopard58> , how is the agent running? is it running in k8s as well?

  
  
Posted one year ago

@<1524922424720625664:profile|TartLeopard58> this might be related to the specific AMI/docker image in which the agent is running... can you use agent version v 1.5.3rc2 ?

  
  
Posted one year ago

@<1523701087100473344:profile|SuccessfulKoala55> I realized that this is not an issue with the cloud or on-premise environment. it’s working well on gke but not working on eks. here is the log when i run “clearml-agent daemon --queue ~” command on eks

root@shelley-gpu-pod:/# clearml-agent daemon --queue shelley3
/usr/local/lib/python3.8/dist-packages/requests/init.py:109: RequestsDependencyWarning: urllib3 (2.0.1) or chardet (None)/charset_normalizer (3.1.0) doesn’t match a supported version!
warnings.warn(
Using environment access key CLEARML_API_ACCESS_KEY=“”
Using environment secret key CLEARML_API_SECRET_KEY=********
Current configuration (clearml_agent v1.5.2, location: None):

agent.worker_id =
agent.worker_name = shelley-gpu-pod
agent.force_git_ssh_protocol = false
agent.python_binary =
agent.package_manager.type = pip
agent.package_manager.pip_version.0 = <20.2 ; python_version < ‘3.10’
agent.package_manager.pip_version.1 = <22.3 ; python_version >= ‘3.10’
agent.package_manager.system_site_packages = false
agent.package_manager.force_upgrade = false
agent.package_manager.conda_channels.0 = pytorch
agent.package_manager.conda_channels.1 = conda-forge
agent.package_manager.conda_channels.2 = defaults
agent.package_manager.priority_optional_packages.0 = pygobject
agent.package_manager.torch_nightly = false
agent.package_manager.poetry_files_from_repo_working_dir = false
agent.venvs_dir = /root/.clearml/venvs-builds
agent.venvs_cache.max_entries = 10
agent.venvs_cache.free_space_threshold_gb = 2.0
agent.venvs_cache.path = ~/.clearml/venvs-cache
agent.vcs_cache.enabled = true
agent.vcs_cache.path = /root/.clearml/vcs-cache
agent.venv_update.enabled = false
agent.pip_download_cache.enabled = true
agent.pip_download_cache.path = /root/.clearml/pip-download-cache
agent.translate_ssh = true
agent.reload_config = false
agent.docker_pip_cache = /root/.clearml/pip-cache
agent.docker_apt_cache = /root/.clearml/apt-cache
agent.docker_force_pull = false
agent.default_docker.image = nvidia/cuda:10.2-cudnn7-runtime-ubuntu18.04
agent.enable_task_env = false
agent.hide_docker_command_env_vars.enabled = true
agent.hide_docker_command_env_vars.parse_embedded_urls = true
agent.abort_callback_max_timeout = 1800
agent.docker_internal_mounts.sdk_cache = /clearml_agent_cache
agent.docker_internal_mounts.apt_cache = /var/cache/apt/archives
agent.docker_internal_mounts.ssh_folder = ~/.ssh
agent.docker_internal_mounts.ssh_ro_folder = /.ssh
agent.docker_internal_mounts.pip_cache = /root/.cache/pip
agent.docker_internal_mounts.poetry_cache = /root/.cache/pypoetry
agent.docker_internal_mounts.vcs_cache = /root/.clearml/vcs-cache
agent.docker_internal_mounts.venv_build = ~/.clearml/venvs-builds
agent.docker_internal_mounts.pip_download = /root/.clearml/pip-download-cache
agent.apply_environment = true
agent.apply_files = true
agent.custom_build_script =
agent.disable_task_docker_override = false
agent.default_python = 3.8
agent.cuda_version = 110
agent.cudnn_version = 0
api.version = 1.5
api.verify_certificate = true
api.default_version = 1.5
api.http.max_req_size = 15728640
api.http.retries.total = 240
api.http.retries.connect = 240
api.http.retries.read = 240
api.http.retries.redirect = 240
api.http.retries.status = 240
api.http.retries.backoff_factor = 1.0
api.http.retries.backoff_max = 120.0
api.http.wait_on_maintenance_forever = true
api.http.pool_maxsize = 512
api.http.pool_connections = 512
api.http.default_method = post
api.api_server = “”
api.files_server = “”
api.web_server = “”
api.credentials.access_key = “”
sdk.storage.cache.default_base_dir = ~/.clearml/cache
sdk.storage.cache.size.min_free_bytes = 10GB
sdk.storage.direct_access.0.url = file://*
sdk.metrics.file_history_size = 100
sdk.metrics.matplotlib_untitled_history_size = 100
sdk.metrics.images.format = JPEG
sdk.metrics.images.quality = 87
sdk.metrics.images.subsampling = 0
sdk.metrics.tensorboard_single_series_per_graph = false
sdk.network.metrics.file_upload_threads = 4
sdk.network.metrics.file_upload_starvation_warning_sec = 120
sdk.network.iteration.max_retries_on_server_error = 5
sdk.network.iteration.retry_backoff_factor_sec = 10
sdk.aws.s3.key =
sdk.aws.s3.region =
sdk.aws.boto3.pool_connections = 512
sdk.aws.boto3.max_multipart_concurrency = 16
sdk.log.null_log_propagate = false
sdk.log.task_log_buffer_capacity = 66
sdk.log.disable_urllib3_info = true
sdk.development.task_reuse_time_window_in_hours = 72.0
sdk.development.vcs_repo_detect_async = true
sdk.development.store_uncommitted_code_diff = true
sdk.development.support_stopping = true
sdk.development.default_output_uri =
sdk.development.force_analyze_entire_repo = false
sdk.development.suppress_update_message = false
sdk.development.detect_with_pip_freeze = false
sdk.development.worker.report_period_sec = 2
sdk.development.worker.ping_period_sec = 30
sdk.development.worker.log_stdout = true
sdk.development.worker.report_global_mem_used = false

Worker “shelley-gpu-pod:gpuGPU-a8a68c42-d19b-c677-5fd3-889bdce415fb” - Listening to queues:
+----------------------------------+----------+-------+
| id | name | tags |
+----------------------------------+----------+-------+
| 1a63b1506e1d4ba4b6ca290a63eceb6b | shelley3 | |
+----------------------------------+----------+-------+

Running CLEARML-AGENT daemon in background mode, writing stdout/stderr to /tmp/.clearml_agent_daemon_outj1su2mo5.txt

  
  
Posted one year ago

Can you share it?

  
  
Posted one year ago

sure! this one?

  
  
Posted one year ago

I tried using K8S_GLUE_POD_AGENT_INSTALL_ARGS=1.5.3rc2 instead of CLEARML_AGENT_UPDATE_VERSION=1.5.3rc2 , but it’s same. doesn’t read gpu usage.. 🥲
image

  
  
Posted one year ago

pod log is too long. would it be ok if i upload pod log file here??

  
  
Posted one year ago

Are there other people experiencing the same issue as me?

  
  
Posted one year ago

any updates?..

  
  
Posted one year ago

here is the agent, task log file~!

  
  
Posted one year ago

it is working on on-premise machine(i can see gpu usage on WORKERS & QUEUES Dashboard). but it is not working on cloud pod

  
  
Posted one year ago

nope. just running “clearml-agent daemon --queue shelley”

  
  
Posted one year ago

Also, can you run it with --foreground and send the log again?

  
  
Posted one year ago

You can do that by passing the CLEARML_AGENT_UPDATE_VERSION=1.5.3rc2 env var

  
  
Posted one year ago

@<1524922424720625664:profile|TartLeopard58> the agent running the task is v1.5.2 (as shown in the log), so the whole point is lost - we need to see the v1.5.3rc2 or v1.5.3rc3 running there... how did you set up the helm chart for the new agent?

  
  
Posted one year ago

Hi again 😊 @<1523701087100473344:profile|SuccessfulKoala55> sure!
image
image

  
  
Posted one year ago

Is that the whole log?

  
  
Posted one year ago

because clearml-agnet is not installed in my gke cluster

  
  
Posted one year ago

for more info, I set CLEARML_AGENT_UPDATE_VERSION=1.5.3rc2 ` in agentk8sglue.basePodTemplate.env

  
  
Posted one year ago

but it's running, isn't it?

  
  
Posted one year ago

Yes

  
  
Posted one year ago

on-premises using an agent?

  
  
Posted one year ago
47K Views
40 Answers
one year ago
one year ago
Tags