okay that's good, that means the agent could run it.
Now it is a matter of matching the TF with cuda (and there is no easy solution for that). Basically I htink that what you need is "nvidia/cuda:10.2-cudnn7-runtime-ubuntu16.04"
docker run --gpus device=0 --rm -it nvidia/cuda:11.3.0-cudnn8-runtime-ubuntu18.04 bash
worked, I could run in it nvidia-smi
and see gpu 0
BTW: the above error is a mismatch between the TF and the docker, TF is looking for cuda 10, and the docker contains cuda 11
These both point to nvidia docker runtime installation issue.
I'm assuming that in both cases you cannot run the docker manually as well, which is essentially what the agent will have to do ...
tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcusolver.so.10'; dlerror: libcusolver.so.10: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
clearml-agent daemon --gpus 0 --queue default --docker nvidia/cuda:11.3.0-cudnn8-runtime-ubuntu18.0 causes not using the GPUs because of missing libs.
the error your are citing happens when running clearml-agent daemon --gpus 0 --queue default --docker nvidia/cuda
ClumsyElephant70 the odd thing is the error here:docker: Error response from daemon: manifest for nvidia/cuda:latest not found: manifest unknown: manifest unknown.
I would imagine it will be with "nvidia/cuda:11.3.0-cudnn8-runtime-ubuntu18.04" but the error is saying "nvidia/cuda:latest"
How could that be ?
Also can you manually run the same command (i.e. docker run --gpus device=0 --rm -it nvidia/cuda:11.3.0-cudnn8-runtime-ubuntu18.04 bash
)?
One more thing: The dockerized version is still not working as I want it to. If I use any specific docker image like docker: nvidia/cuda:11.3.0-cudnn8-runtime-ubuntu18.04
on a host machine with NVIDIA-SMI 465.19.01 Driver Version: 465.19.01 CUDA Version: 11.3
I always get a similar error as above where a lib is missing. If I use the example from http://clear.ml clearml-agent daemon --gpus 0 --queue default --docker nvidia/cuda
I always get this error docker: Error response from daemon: manifest for nvidia/cuda:latest not found: manifest unknown: manifest unknown.
AgitatedDove14 I created a new clean venv and freshly installed the clearml-agent
under python / pip 3.8 and now it is working again. Still don't know what caused this issue. Thank you very much for helping!
okay this seems like a broken pip install python3.6
Can you verify it fails on another folder (maybe it's a permissions thing, for example if you run in docker mode, then the permissions will be root, as the docker is creating those folders)
python3.6 -m virtualenv /home/tobias_vitt/.clearml/venvs-builds/3.6
returns StopIteration:
ClumsyElephant70
Can you manually run the same command ?['python3.6', '-m', 'virtualenv', '/home/user/.clearml/venvs-builds/3.6']
Basically:python3.6 -m virtualenv /home/user/.clearml/venvs-builds/3.6'
` 2021-05-06 13:46:34.032391: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties:
pciBusID: 0000:a1:00.0 name: NVIDIA Quadro RTX 8000 computeCapability: 7.5
coreClock: 1.77GHz coreCount: 72 deviceMemorySize: 47.46GiB deviceMemoryBandwidth: 625.94GiB/s
2021-05-06 13:46:34.032496: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2021-05-06 13:46:34.032593: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcublas.so.11'; dlerror: libcublas.so.11: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2021-05-06 13:46:34.032660: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcublasLt.so.11'; dlerror: libcublasLt.so.11: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2021-05-06 13:46:34.045898: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2021-05-06 13:46:34.049645: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2021-05-06 13:46:34.072485: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10
2021-05-06 13:46:34.072783: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcusparse.so.11'; dlerror: libcusparse.so.11: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2021-05-06 13:46:34.072973: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudnn.so.8'; dlerror: libcudnn.so.8: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2021-05-06 13:46:34.073003: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1757] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at for how to download and setup the required libraries for your platform.
Skipping registering GPU devices... `
When using clearml-agent daemon --queue default --docker
it is running. In this case I always had some issues when adding the --gpu
flag.
yes, this one is running in venv and not docker, because I had some issues with cuda and docker. The virtualenv==20.4.6
in the requirements.txt. I think it broke after installing clearml-serving
in the same env.
ClumsyElephant70
Could it be virtualenv package is not installed on the host machine ?
(From the log it seems you are running in venv mode, is that correct?)
` Process failed, exit code 1task ab1a90dacb9042eea8e4a6a16640d7f4 pulled from 8f06b6b160c14a3591d791c1885b309e by worker test:gpu1
Running task 'ab1a90dacb9042eea8e4a6a16640d7f4'
Storing stdout and stderr log to '/tmp/.clearml_agent_out.kbkz1n40.txt', '/tmp/.clearml_agent_out.kbkz1n40.txt'
Current configuration (clearml_agent v1.0.0, location: /tmp/.clearml_agent.3e6l7juj.cfg):
sdk.storage.cache.default_base_dir = ~/.clearml/cache
sdk.storage.cache.size.min_free_bytes = 10GB
sdk.storage.direct_access.0.url = file://*
sdk.metrics.file_history_size = 100
sdk.metrics.matplotlib_untitled_history_size = 100
sdk.metrics.images.format = JPEG
sdk.metrics.images.quality = 87
sdk.metrics.images.subsampling = 0
sdk.metrics.tensorboard_single_series_per_graph = false
sdk.network.metrics.file_upload_threads = 4
sdk.network.metrics.file_upload_starvation_warning_sec = 120
sdk.network.iteration.max_retries_on_server_error = 5
sdk.network.iteration.retry_backoff_factor_sec = 10
sdk.aws.s3.key =
sdk.aws.s3.region =
sdk.aws.boto3.pool_connections = 512
sdk.aws.boto3.max_multipart_concurrency = 16
sdk.log.null_log_propagate = false
sdk.log.task_log_buffer_capacity = 66
sdk.log.disable_urllib3_info = true
sdk.development.task_reuse_time_window_in_hours = 72.0
sdk.development.vcs_repo_detect_async = true
sdk.development.store_uncommitted_code_diff = true
sdk.development.support_stopping = true
sdk.development.default_output_uri =
sdk.development.force_analyze_entire_repo = false
sdk.development.suppress_update_message = false
sdk.development.detect_with_pip_freeze = false
sdk.development.worker.report_period_sec = 2
sdk.development.worker.ping_period_sec = 30
sdk.development.worker.log_stdout = true
sdk.development.worker.report_global_mem_used = false
agent.worker_id = test:gpu1
agent.worker_name = test
agent.force_git_ssh_protocol = false
agent.python_binary =
agent.package_manager.type = pip
agent.package_manager.pip_version = <20.2
agent.package_manager.system_site_packages = false
agent.package_manager.force_upgrade = false
agent.package_manager.conda_channels.0 = defaults
agent.package_manager.conda_channels.1 = conda-forge
agent.package_manager.conda_channels.2 = pytorch
agent.package_manager.torch_nightly = false
agent.venvs_dir = /home/user/.clearml/venvs-builds
agent.venvs_cache.max_entries = 10
agent.venvs_cache.free_space_threshold_gb = 2.0
agent.vcs_cache.enabled = true
agent.vcs_cache.path = /home/user/.clearml/vcs-cache
agent.venv_update.enabled = false
agent.pip_download_cache.enabled = true
agent.pip_download_cache.path = /home/user/.clearml/pip-download-cache
agent.translate_ssh = true
agent.reload_config = false
agent.docker_pip_cache = /home/user/.clearml/pip-cache
agent.docker_apt_cache = /home/user/.clearml/apt-cache
agent.docker_force_pull = false
agent.default_docker.image = nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04
agent.enable_task_env = false
agent.git_user = *****
agent.default_python = 3.6
agent.cuda_version = 113
agent.cudnn_version = 0
api.version = 1.5
api.verify_certificate = true
api.default_version = 1.5
api.http.max_req_size = 15728640
api.http.retries.total = 240
api.http.retries.connect = 240
api.http.retries.read = 240
api.http.retries.redirect = 240
api.http.retries.status = 240
api.http.retries.backoff_factor = 1.0
api.http.retries.backoff_max = 120.0
api.http.wait_on_maintenance_forever = true
api.http.pool_maxsize = 512
api.http.pool_connections = 512
api.api_server = http://***:8008
api.web_server = http://***:8080
api.files_server = http://***:8081
api.credentials.access_key = ***
api.host = http://***:8008
Executing task id [ab1a90dacb9042eea8e4a6a16640d7f4]:
repository = https://***/clear-ml-test
branch = ***
version_num = b863ee781ee2e29121f8f5045c4bde6709702fba
tag =
docker_cmd = None
entry_point = stage_hello_world.py
working_dir = .
StopIteration:
clearml_agent: ERROR: Command '['python3.6', '-m', 'virtualenv', '/home/user/.clearml/venvs-builds/3.6']' returned non-zero exit status 1.
DONE: Running task 'ab1a90dacb9042eea8e4a6a16640d7f4', exit status 1 `
Hi AgitatedDove14 , I get an error when running a task on my worker. I have looked into /home/user/.clearml/venvs-builds
but it is empty. Any idea why this happens? I actually don’t know what I changed to cause this issue… I’m running clearml-agent v1.0.0
clearml_agent: ERROR: Command '['python3.6', '-m', 'virtualenv', '/home/user/.clearml/venvs-builds/3.6']' returned non-zero exit status 1.
TenseOstrich47 it's based on free "index" so the first index not in used will be captured, but if you remove agents, then the order will change e.g. you take down worker #1 , the next worker you spin will be #1 becuase it is not taken)
are the envs named after the worker enumeration? e.g. venv-bulds-0 is linked to worker 0?
ohhh ok. so I can actually remove this if those workers are no longer in use
TenseOstrich47 every agent instance has its own venv copy. Obviously every new experiment will remove the old venv and create a new one. Make sense?