Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hi, We Have Clearml On K8 Setup. Using The Below, We Run Dynamic Pods On The Cluster.

Hi, we have clearml on k8 setup. Using the below, we run dynamic pods on the cluster.
k8s_glue_example.py --queue glue_qI pushed this code, https://github.com/allegroai/clearml/blob/master/examples/optimization/hyper-parameter-optimization/hyper_parameter_optimizer.py to the glue_q queue above. It spun multiple k8s pods for the experiments but all of them exited with the below error.
2021-03-10 15:39:50 Collecting tensorboard-plugin-wit==1.8.0 Downloading tensorboard_plugin_wit-1.8.0-py3-none-any.whl (781 kB) Collecting tensorflow==2.4.1 Downloading tensorflow-2.4.1-cp36-cp36m-manylinux2010_x86_64.whl (394.3 MB) 2021-03-10 15:47:37 User aborted: stopping task (3)What could be the issue ?

  
  
Posted 3 years ago
Votes Newest

Answers 11


So now you don’t have any failures but gpu usage issue?

I didnt run the hyper_parameter_optimzer.py, as I was thinking if there is already a problem with the base, no use with running the series of experiments

How about running the ClearML agent in docker mode?

Prev, we had our clearml-agent run in the bare-metal machine instead in docker formation. There wasnt any issue.. Though I havent tried with 0.17.2 version

  
  
Posted 3 years ago

Hi TimelyPenguin76 ,
Instead of running the hyper_parameter_optimizer.py, I tried running the base_template_keras_simple.py instead.. It seems that I didnt use the GPU, however when i ssh into clearml-glueq-id-ffaf55c984ea4dbfb059387b983746ba:gpuall pod, and ran nvidia-smi, it gave an output.

  
  
Posted 3 years ago

let me run the clearml-agent outside the k8 system.. and get back to u

  
  
Posted 3 years ago

Could not load dynamic library 'libcupti.so.11.0'; dlerror: libcupti.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64 2021-03-11 09:11:17.368793: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcupti.so'; dlerror: libcupti.so: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64 2021-03-11 09:11:17.368810: E tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1415] function cupti_interface_->Subscribe( &subscriber_, (CUpti_CallbackFunc)ApiCallback, this)failed with error CUPTI could not be loaded or symbol could not be found.Seems that it unable to load the GPU as well. hmmm .. Below is how, I ran the agent
clearml-agent daemon --queue 238_q --docker nvidia/cuda:10.1-cudnn7-runtime --force-current-version --foreground

  
  
Posted 3 years ago

👍

  
  
Posted 3 years ago

👌

  
  
Posted 3 years ago

Hi DeliciousBluewhale87

So now you don’t have any failures but gpu usage issue? How about running the ClearML agent in docker mode? You can choose an Nvidia docker image and all the Cuda installations and configuration will be part of the image.
What do you think?

  
  
Posted 3 years ago

Hi DeliciousBluewhale87

Can you share the version you are using? Did you get any other logs? maybe from the pod?

  
  
Posted 3 years ago

We have k8s on ec2 instances in the cloud. I'll try it there 2morrow and report back..

  
  
Posted 3 years ago

try with Cuda11.0 image,

  
  
Posted 3 years ago
959 Views
11 Answers
3 years ago
one year ago
Tags