Hi Anyone

Answered

Hi Anyone

Hi anyone tried clearml-serving? Can you please help me debugging below error while launching a clearml triton inference server(using command "clearml-serving launch --queue default ") the error I get at clearml-agent log is the following:

Update model v4 in /models/keras_mnist/4 Starting server: ['tritonserver', '--model-control-mode=poll', '--model-repository=/models', '--repository-poll-secs=600.0', '--metrics-port=8002', '--allow-metrics=true', '--allow-gpu-metrics=true'] Traceback (most recent call last): File "clearml_serving/triton_helper.py", line 214, in <module> main() File "clearml_serving/triton_helper.py", line 209, in main metric_frequency_sec=args.metric_frequency*60.0, File "clearml_serving/triton_helper.py", line 117, in maintenance_daemon proc = subprocess.Popen(cmd) File "/anaconda3/lib/python3.7/subprocess.py", line 800, in __init__ restore_signals, start_new_session) File "/anaconda3/lib/python3.7/subprocess.py", line 1551, in _execute_child raise child_exception_type(errno_num, err_msg, err_filename) FileNotFoundError: [Errno 2] No such file or directory: 'tritonserver': 'tritonserver'Note: I have installed required triton version(can be seen in log file) and please check attached file for complete log information

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AstonishingKoala64
				
					0
					 × 1

Votes Newest

Answers 19

So instead of updating gpu drivers can we install a lower compatible version of CUDA inside docker for clearml-serving?

Also when I checked log file I found this
agent.default_docker.image = nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04 agent.enable_task_env = false agent.git_user = agent.default_python = 3.8 agent.cuda_version = 112This might be a dumb question but I'm confused with CUDA version being installed here, is it 10.1(from first line) or 11.2(from last line)?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AstonishingKoala64
				
					0
					 × 1

Hi AstonishingWorm64
I think you are correct, there is external interface to change the docker.
Could you open a GitHub issue so we do not forget to add an interface for that ?
As a temp hack, you can manually clone "triton serving engine" and edit the container image (under the execution Tab).
wdyt?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

That sounds like an internal tritonserver error.
https://forums.developer.nvidia.com/t/provided-ptx-was-compiled-with-an-unsupported-toolchain-error-using-cub/168292

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

yeah sounds good

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AstonishingKoala64
				
					0
					 × 1

That is a good question, usually the cuda version is automatically detected, unless you overrride it with the conf file or OS env. What's the setup? Are you using as package manager ? (conda actually installs CUDA drivers, if the original Task was executed on a machine with conda, it will take the CUDA version automatically, reason is to match the CUDA/Torch/TF)

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Tried installing latest clearml-serving from git, but still not luck same error persists.

I have attached both serving service and serving engine(triton) console logs from clearml-server, please have a look at them

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AstonishingKoala64
				
					0
					 × 1

FileNotFoundError: [Errno 2] No such file or directory: 'tritonserver': 'tritonserver'

This is oddd.
Can you retry with the latest from the github ?
pip install git+

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Bottom line the driver version in the host machine does not support the CUDA version you have in the docker container

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

https://stackoverflow.com/questions/65413429/cuda-complains-about-nvcc-being-an-unsupported-toolchain

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

AstonishingWorm64 can you share the full log (In the UI under Results/Console there is a download button)?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

The latest image seems to require drivers on the host 460+
try this one:
https://docs.nvidia.com/deeplearning/triton-inference-server/release-notes/rel_20-12.html#rel_20-12

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Hi AgitatedDove14 , thanks for the reply!

It's not the same issue that you just pointed, in fact the issue is raised after launching inference onto the queue using below commands
` clearml-serving triton --project "serving" --name "serving example"

clearml-serving triton --endpoint "keras_mnist" --model-project "examples" --model-name "Keras MNIST serve example - serving_model"

clearml-serving launch --queue default `

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AstonishingKoala64
				
					0
					 × 1

(I'll make sure we reply on the issue as well later)

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

This solved tritonserver not found issue but now a new error is occuring which is UNAVAILABLE: Internal: unable to create stream: the provided PTX was compiled with an unsupported toolchain

Please check attached log file for complete console log.

And also I am facing issue while initializing serving server and triton engine using below two commands:
clearml-serving triton --project "serving" --name "serving ex1"
clearml-serving triton --endpoint "inference" --model-project "serving" --model-name "exp_v1"So after the second command I am seeing below error
Error: No projects found when searching forDevOps
But when I clubbed these two commands as a single command like below, the error disappeared so I went on and launced service and engine, does this change in blending of commands resulted in above error?
clearml-serving triton --project "serving" --name "serving ex1" --endpoint "inference" --model-project "serving" --model-name "exp_v1"

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AstonishingKoala64
				
					0
					 × 1

I already shared log from UI, anyways I'm sharing log for recently tried experiment please find the attachment

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AstonishingKoala64
				
					0
					 × 1

By default clearml-serving is installing triton version 21.03, can we somehow override this to install some other version. I tried to configure but could not find anything related to tritonserver in clearml.conf file. So can you please guide me on this?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AstonishingKoala64
				
					0
					 × 1

Hi AstonishingWorm64
Is this the same ?
https://github.com/allegroai/clearml-serving/issues/1
(I think it was fixed on the later branch, we are releasing 0.3.2 later today with a fix)
Can you try:
pip install git+

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Server that I'm using has GPU Driver Version: 455.23.05 and CUDA Version: 11.1, Conda is also installed and clearml-serving is installing cuda version 10.1 for which gpu drivers should be >= 418.39 so I guess version mismatch is not the problem and currently I can't update gpu drivers since other processes are running.

And also I tried overriding clearml.conf file and changed default docker image by modifying below line
image: "nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04"to this:
image: "nvidia/cuda:11.1-cudnn8-runtime-ubuntu18.04"But still same same error the provided PTX was compiled with an unsupported toolchain occurred while launching triton-engine

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AstonishingKoala64
				
					0
					 × 1

AstonishingWorm64 I found the issue.
The cleamlr-serving assume the agent is working in docker mode, as it Has to have the triton docker (where triton engine is installed).
Since you are running in venv mode, tritonserver is not installed, hence the error

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Write your answer

2K Views

19 Answers

4 years ago

2 years ago