Hi, We Have Clearml On K8 Setup. Using The Below, We Run Dynamic Pods On The Cluster.

Answered

Hi, we have clearml on k8 setup. Using the below, we run dynamic pods on the cluster.
k8s_glue_example.py --queue glue_qI pushed this code, https://github.com/allegroai/clearml/blob/master/examples/optimization/hyper-parameter-optimization/hyper_parameter_optimizer.py to the glue_q queue above. It spun multiple k8s pods for the experiments but all of them exited with the below error.
2021-03-10 15:39:50 Collecting tensorboard-plugin-wit==1.8.0 Downloading tensorboard_plugin_wit-1.8.0-py3-none-any.whl (781 kB) Collecting tensorflow==2.4.1 Downloading tensorflow-2.4.1-cp36-cp36m-manylinux2010_x86_64.whl (394.3 MB) 2021-03-10 15:47:37 User aborted: stopping task (3)What could be the issue ?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					DeliciousBluewhale87
				
					0
					 × 1

Votes Newest

Answers 11

Hi DeliciousBluewhale87

So now you don’t have any failures but gpu usage issue? How about running the ClearML agent in docker mode? You can choose an Nvidia docker image and all the Cuda installations and configuration will be part of the image.
What do you think?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					TimelyPenguin76
				
					0
					 Administrator

try with Cuda11.0 image,

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					TimelyPenguin76
				
					0
					 Administrator

So now you don’t have any failures but gpu usage issue?

I didnt run the hyper_parameter_optimzer.py, as I was thinking if there is already a problem with the base, no use with running the series of experiments

How about running the ClearML agent in docker mode?

Prev, we had our clearml-agent run in the bare-metal machine instead in docker formation. There wasnt any issue.. Though I havent tried with 0.17.2 version

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					DeliciousBluewhale87
				
					0
					 × 1

let me run the clearml-agent outside the k8 system.. and get back to u

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					DeliciousBluewhale87
				
					0
					 × 1

Could not load dynamic library 'libcupti.so.11.0'; dlerror: libcupti.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64 2021-03-11 09:11:17.368793: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcupti.so'; dlerror: libcupti.so: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64 2021-03-11 09:11:17.368810: E tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1415] function cupti_interface_->Subscribe( &subscriber_, (CUpti_CallbackFunc)ApiCallback, this)failed with error CUPTI could not be loaded or symbol could not be found.Seems that it unable to load the GPU as well. hmmm .. Below is how, I ran the agent
clearml-agent daemon --queue 238_q --docker nvidia/cuda:10.1-cudnn7-runtime --force-current-version --foreground

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					DeliciousBluewhale87
				
					0
					 × 1

Hi TimelyPenguin76 ,
Instead of running the hyper_parameter_optimizer.py, I tried running the base_template_keras_simple.py instead.. It seems that I didnt use the GPU, however when i ssh into clearml-glueq-id-ffaf55c984ea4dbfb059387b983746ba:gpuall pod, and ran nvidia-smi, it gave an output.

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					DeliciousBluewhale87
				
					0
					 × 1

We have k8s on ec2 instances in the cloud. I'll try it there 2morrow and report back..

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					DeliciousBluewhale87
				
					0
					 × 1

👍

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					TimelyPenguin76
				
					0
					 Administrator

https://hub.docker.com/r/nvidia/cuda/tags?page=1&ordering=last_updated&name=11.0-

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					TimelyPenguin76
				
					0
					 Administrator

👌

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					DeliciousBluewhale87
				
					0
					 × 1

Hi DeliciousBluewhale87

Can you share the version you are using? Did you get any other logs? maybe from the pod?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					TimelyPenguin76
				
					0
					 Administrator

Write your answer

2K Views

11 Answers

4 years ago

2 years ago