Hey All. Quick Question About The

Answered

Hey all. Quick question about the ~/.clearml directory on the clearml-agent. It seems to be filling up disk storage quite quickly
440M /home/ubuntu/.clearml/venvs-builds.4 440M /home/ubuntu/.clearml/venvs-builds.3 441M /home/ubuntu/.clearml/venvs-builds.1Is this expected behaviour?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					TenseOstrich47
				
					0
					 × 1

Votes Newest

Answers 25

okay that's good, that means the agent could run it.
Now it is a matter of matching the TF with cuda (and there is no easy solution for that). Basically I htink that what you need is "nvidia/cuda:10.2-cudnn7-runtime-ubuntu16.04"

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

docker run --gpus device=0 --rm -it nvidia/cuda:11.3.0-cudnn8-runtime-ubuntu18.04 bash worked, I could run in it nvidia-smi and see gpu 0

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					ClumsyElephant70
				
					0
					 × 1

BTW: the above error is a mismatch between the TF and the docker, TF is looking for cuda 10, and the docker contains cuda 11

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

These both point to nvidia docker runtime installation issue.
I'm assuming that in both cases you cannot run the docker manually as well, which is essentially what the agent will have to do ...

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcusolver.so.10'; dlerror: libcusolver.so.10: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					ClumsyElephant70
				
					0
					 × 1

clearml-agent daemon --gpus 0 --queue default --docker nvidia/cuda:11.3.0-cudnn8-runtime-ubuntu18.0 causes not using the GPUs because of missing libs.

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					ClumsyElephant70
				
					0
					 × 1

the error your are citing happens when running clearml-agent daemon --gpus 0 --queue default --docker nvidia/cuda

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					ClumsyElephant70
				
					0
					 × 1

ClumsyElephant70 the odd thing is the error here:
docker: Error response from daemon: manifest for nvidia/cuda:latest not found: manifest unknown: manifest unknown.I would imagine it will be with "nvidia/cuda:11.3.0-cudnn8-runtime-ubuntu18.04" but the error is saying "nvidia/cuda:latest"
How could that be ?
Also can you manually run the same command (i.e. docker run --gpus device=0 --rm -it nvidia/cuda:11.3.0-cudnn8-runtime-ubuntu18.04 bash )?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

One more thing: The dockerized version is still not working as I want it to. If I use any specific docker image like docker: nvidia/cuda:11.3.0-cudnn8-runtime-ubuntu18.04 on a host machine with NVIDIA-SMI 465.19.01 Driver Version: 465.19.01 CUDA Version: 11.3 I always get a similar error as above where a lib is missing. If I use the example from http://clear.ml clearml-agent daemon --gpus 0 --queue default --docker nvidia/cuda I always get this error docker: Error response from daemon: manifest for nvidia/cuda:latest not found: manifest unknown: manifest unknown.

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					ClumsyElephant70
				
					0
					 × 1

AgitatedDove14 I created a new clean venv and freshly installed the clearml-agent under python / pip 3.8 and now it is working again. Still don't know what caused this issue. Thank you very much for helping!

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					ClumsyElephant70
				
					0
					 × 1

okay this seems like a broken pip install python3.6
Can you verify it fails on another folder (maybe it's a permissions thing, for example if you run in docker mode, then the permissions will be root, as the docker is creating those folders)

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

without content

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					ClumsyElephant70
				
					0
					 × 1

python3.6 -m virtualenv /home/tobias_vitt/.clearml/venvs-builds/3.6 returns StopIteration:

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					ClumsyElephant70
				
					0
					 × 1

ClumsyElephant70
Can you manually run the same command ?
['python3.6', '-m', 'virtualenv', '/home/user/.clearml/venvs-builds/3.6']Basically:
python3.6 -m virtualenv /home/user/.clearml/venvs-builds/3.6'

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

` 2021-05-06 13:46:34.032391: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties:

pciBusID: 0000:a1:00.0 name: NVIDIA Quadro RTX 8000 computeCapability: 7.5

coreClock: 1.77GHz coreCount: 72 deviceMemorySize: 47.46GiB deviceMemoryBandwidth: 625.94GiB/s

2021-05-06 13:46:34.032496: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64

2021-05-06 13:46:34.032593: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcublas.so.11'; dlerror: libcublas.so.11: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64

2021-05-06 13:46:34.032660: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcublasLt.so.11'; dlerror: libcublasLt.so.11: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64

2021-05-06 13:46:34.045898: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10

2021-05-06 13:46:34.049645: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10

2021-05-06 13:46:34.072485: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10

2021-05-06 13:46:34.072783: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcusparse.so.11'; dlerror: libcusparse.so.11: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64

2021-05-06 13:46:34.072973: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudnn.so.8'; dlerror: libcudnn.so.8: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64

2021-05-06 13:46:34.073003: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1757] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at for how to download and setup the required libraries for your platform.

Skipping registering GPU devices... `

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					ClumsyElephant70
				
					0
					 × 1

When using clearml-agent daemon --queue default --docker it is running. In this case I always had some issues when adding the --gpu flag.

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					ClumsyElephant70
				
					0
					 × 1

yes, this one is running in venv and not docker, because I had some issues with cuda and docker. The virtualenv==20.4.6 in the requirements.txt. I think it broke after installing clearml-serving in the same env.

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					ClumsyElephant70
				
					0
					 × 1

ClumsyElephant70
Could it be virtualenv package is not installed on the host machine ?
(From the log it seems you are running in venv mode, is that correct?)

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

` Process failed, exit code 1task ab1a90dacb9042eea8e4a6a16640d7f4 pulled from 8f06b6b160c14a3591d791c1885b309e by worker test:gpu1
Running task 'ab1a90dacb9042eea8e4a6a16640d7f4'
Storing stdout and stderr log to '/tmp/.clearml_agent_out.kbkz1n40.txt', '/tmp/.clearml_agent_out.kbkz1n40.txt'
Current configuration (clearml_agent v1.0.0, location: /tmp/.clearml_agent.3e6l7juj.cfg):

sdk.storage.cache.default_base_dir = ~/.clearml/cache
sdk.storage.cache.size.min_free_bytes = 10GB
sdk.storage.direct_access.0.url = file://*
sdk.metrics.file_history_size = 100
sdk.metrics.matplotlib_untitled_history_size = 100
sdk.metrics.images.format = JPEG
sdk.metrics.images.quality = 87
sdk.metrics.images.subsampling = 0
sdk.metrics.tensorboard_single_series_per_graph = false
sdk.network.metrics.file_upload_threads = 4
sdk.network.metrics.file_upload_starvation_warning_sec = 120
sdk.network.iteration.max_retries_on_server_error = 5
sdk.network.iteration.retry_backoff_factor_sec = 10
sdk.aws.s3.key =
sdk.aws.s3.region =
sdk.aws.boto3.pool_connections = 512
sdk.aws.boto3.max_multipart_concurrency = 16
sdk.log.null_log_propagate = false
sdk.log.task_log_buffer_capacity = 66
sdk.log.disable_urllib3_info = true
sdk.development.task_reuse_time_window_in_hours = 72.0
sdk.development.vcs_repo_detect_async = true
sdk.development.store_uncommitted_code_diff = true
sdk.development.support_stopping = true
sdk.development.default_output_uri =
sdk.development.force_analyze_entire_repo = false
sdk.development.suppress_update_message = false
sdk.development.detect_with_pip_freeze = false
sdk.development.worker.report_period_sec = 2
sdk.development.worker.ping_period_sec = 30
sdk.development.worker.log_stdout = true
sdk.development.worker.report_global_mem_used = false
agent.worker_id = test:gpu1
agent.worker_name = test
agent.force_git_ssh_protocol = false
agent.python_binary =
agent.package_manager.type = pip
agent.package_manager.pip_version = <20.2
agent.package_manager.system_site_packages = false
agent.package_manager.force_upgrade = false
agent.package_manager.conda_channels.0 = defaults
agent.package_manager.conda_channels.1 = conda-forge
agent.package_manager.conda_channels.2 = pytorch
agent.package_manager.torch_nightly = false
agent.venvs_dir = /home/user/.clearml/venvs-builds
agent.venvs_cache.max_entries = 10
agent.venvs_cache.free_space_threshold_gb = 2.0
agent.vcs_cache.enabled = true
agent.vcs_cache.path = /home/user/.clearml/vcs-cache
agent.venv_update.enabled = false
agent.pip_download_cache.enabled = true
agent.pip_download_cache.path = /home/user/.clearml/pip-download-cache
agent.translate_ssh = true
agent.reload_config = false
agent.docker_pip_cache = /home/user/.clearml/pip-cache
agent.docker_apt_cache = /home/user/.clearml/apt-cache
agent.docker_force_pull = false
agent.default_docker.image = nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04
agent.enable_task_env = false
agent.git_user = *****
agent.default_python = 3.6
agent.cuda_version = 113
agent.cudnn_version = 0
api.version = 1.5
api.verify_certificate = true
api.default_version = 1.5
api.http.max_req_size = 15728640
api.http.retries.total = 240
api.http.retries.connect = 240
api.http.retries.read = 240
api.http.retries.redirect = 240
api.http.retries.status = 240
api.http.retries.backoff_factor = 1.0
api.http.retries.backoff_max = 120.0
api.http.wait_on_maintenance_forever = true
api.http.pool_maxsize = 512
api.http.pool_connections = 512
api.api_server = http://***:8008
api.web_server = http://***:8080
api.files_server = http://***:8081
api.credentials.access_key = ***
api.host = http://***:8008

Executing task id [ab1a90dacb9042eea8e4a6a16640d7f4]:
repository = https://***/clear-ml-test
branch = ***
version_num = b863ee781ee2e29121f8f5045c4bde6709702fba
tag =
docker_cmd = None
entry_point = stage_hello_world.py
working_dir = .

StopIteration:

clearml_agent: ERROR: Command '['python3.6', '-m', 'virtualenv', '/home/user/.clearml/venvs-builds/3.6']' returned non-zero exit status 1.

DONE: Running task 'ab1a90dacb9042eea8e4a6a16640d7f4', exit status 1 `

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					ClumsyElephant70
				
					0
					 × 1

Can you send the full log ?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Hi AgitatedDove14 , I get an error when running a task on my worker. I have looked into /home/user/.clearml/venvs-builds but it is empty. Any idea why this happens? I actually don’t know what I changed to cause this issue… I’m running clearml-agent v1.0.0

clearml_agent: ERROR: Command '['python3.6', '-m', 'virtualenv', '/home/user/.clearml/venvs-builds/3.6']' returned non-zero exit status 1.

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					ClumsyElephant70
				
					0
					 × 1

TenseOstrich47 it's based on free "index" so the first index not in used will be captured, but if you remove agents, then the order will change e.g. you take down worker #1 , the next worker you spin will be #1 becuase it is not taken)

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

are the envs named after the worker enumeration? e.g. venv-bulds-0 is linked to worker 0?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					TenseOstrich47
				
					0
					 × 1

ohhh ok. so I can actually remove this if those workers are no longer in use

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					TenseOstrich47
				
					0
					 × 1

TenseOstrich47 every agent instance has its own venv copy. Obviously every new experiment will remove the old venv and create a new one. Make sense?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Write your answer

2K Views

25 Answers

4 years ago

2 years ago