Hi All! I Recently Started Working With Clearml Serving. I Got This Example Working

Answered

Hi all! I recently started working with clearML serving. I got this example working https://github.com/allegroai/clearml-serving/tree/main/examples/pytorch and now want to serve my own model. The problem ist that I am always getting the following error clearml-serving-triton E0126 14:19:09.743189 35 model_repository_manager.cc:2064] Poll failed for model directory 'test_model_pytorch': failed to open text file for read /models/test_model_pytorch/config.pbtxt: No such file or directory . I serve my model by calling clearml-serving --id 433aa14db3f545ad852ddf746e24dcf0 model add --engine triton --endpoint "test_model_pytorch" --preprocess "preprocess.py" --name "test_model" --project "Body Position Detection" --input-size "[1, 64]" --input-name "INPUT__0" --input-type float32 --output-size "[1, 32]" --output-name "OUTPUT__0" --output-type float32 . Now I wonder why I have to specify a config.pbtxt as the example does not require this and I thought specifying the input and output name, size, ... via command line should be enough. Additionally I do not know where to locate the config.pbtxt file.

  				
Posted 
	2 years ago

					More  		
  Report
		
					NuttyCamel41
				
					0
					 × 1

Votes Newest

Answers 22

Ok, so I killed all docker containers (the proposal by chatgpt did not work for me, but your commands did). The result is that we have one less warning. The warning clearml-serving-triton | Warning: more than one valid Controller Tasks found, using Task ID=4709b0b383a04bb1a033e99fd325dcbf seems to be solved. All remaining errors come up in the clearml-serving-triton service and this is the log I get

CLEARML_SERVING_TASK_ID=9309c20af9244d919b0f063642198c57
CLEARML_TRITON_POLL_FREQ=1.0
CLEARML_TRITON_METRIC_FREQ=1.0
CLEARML_TRITON_HELPER_ARGS=
CLEARML_EXTRA_PYTHON_PACKAGES=
clearml-serving - Nvidia Triton Engine Controller
ClearML Task: created new task id=ad7bd1d205a24f3086ad4cdc9a94017d
2023-01-27 08:15:50,264 - clearml.Task - INFO - No repository found, storing script code instead
WARNING: [Torch-TensorRT] - Unable to read CUDA capable devices. Return status: 35
I0127 08:15:56.498773 34 libtorch.cc:1381] TRITONBACKEND_Initialize: pytorch
I0127 08:15:56.498849 34 libtorch.cc:1391] Triton TRITONBACKEND API version: 1.9
I0127 08:15:56.498856 34 libtorch.cc:1397] 'pytorch' TRITONBACKEND API version: 1.9
2023-01-27 08:15:56.868725: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
I0127 08:15:56.904341 34 tensorflow.cc:2181] TRITONBACKEND_Initialize: tensorflow
I0127 08:15:56.904373 34 tensorflow.cc:2191] Triton TRITONBACKEND API version: 1.9
I0127 08:15:56.904380 34 tensorflow.cc:2197] 'tensorflow' TRITONBACKEND API version: 1.9
I0127 08:15:56.904384 34 tensorflow.cc:2221] backend configuration:
{}
I0127 08:15:56.918633 34 onnxruntime.cc:2400] TRITONBACKEND_Initialize: onnxruntime
I0127 08:15:56.918656 34 onnxruntime.cc:2410] Triton TRITONBACKEND API version: 1.9
I0127 08:15:56.918659 34 onnxruntime.cc:2416] 'onnxruntime' TRITONBACKEND API version: 1.9
I0127 08:15:56.918662 34 onnxruntime.cc:2446] backend configuration:
{}
I0127 08:15:56.935321 34 openvino.cc:1207] TRITONBACKEND_Initialize: openvino
I0127 08:15:56.935344 34 openvino.cc:1217] Triton TRITONBACKEND API version: 1.9
I0127 08:15:56.935348 34 openvino.cc:1223] 'openvino' TRITONBACKEND API version: 1.9
W0127 08:15:56.936061 34 pinned_memory_manager.cc:236] Unable to allocate pinned system memory, pinned memory pool will not be available: CUDA driver version is insufficient for CUDA runtime version
I0127 08:15:56.937483 34 cuda_memory_manager.cc:115] CUDA memory pool disabled
I0127 08:15:56.939850 34 server.cc:549] 
+------------------+------+
2023-01-27T08:15:56.944545100Z | Agent | Path |
+------------------+------+
+------------------+------+

I0127 08:15:56.939906 34 server.cc:576] 
+-------------+-------------------------------------------------------------------------+--------+
2023-01-27T08:15:56.944562300Z |     | Path                                                                    | Config |
+-------------+-------------------------------------------------------------------------+--------+
2023-01-27T08:15:56.944568100Z |     | /opt/tritonserver/backends/pytorch/libtriton_pytorch.so                 | {}     |
2023-01-27T08:15:56.944570800Z |  | /opt/tritonserver/backends/tensorflow1/libtriton_tensorflow1.so         | {}     |
2023-01-27T08:15:56.944573400Z | | /opt/tritonserver/backends/onnxruntime/libtriton_onnxruntime.so         | {}     |
2023-01-27T08:15:56.944576500Z |    | /opt/tritonserver/backends/openvino_2021_4/libtriton_openvino_2021_4.so | {}     |
+-------------+-------------------------------------------------------------------------+--------+

I0127 08:15:56.940411 34 server.cc:619] 
+-------+---------+--------+
2023-01-27T08:15:56.944589300Z | | Version | Status |
+-------+---------+--------+
+-------+---------+--------+

Error: Failed to initialize NVML
W0127 08:15:56.945560 34 metrics.cc:571] DCGM unable to start: DCGM initialization error
I0127 08:15:56.946387 34 tritonserver.cc:2123] 
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
2023-01-27T08:15:56.946463100Z |                           | Value                                                                                                                                                                                        |
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
2023-01-27T08:15:56.946468000Z |                        | triton                                                                                                                                                                                       |
2023-01-27T08:15:56.946470200Z |                   | 2.21.0                                                                                                                                                                                       |
2023-01-27T08:15:56.946478400Z |                | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data statistics trace |
2023-01-27T08:15:56.946480900Z |         | /models                                                                                                                                                                                      |
2023-01-27T08:15:56.946483100Z |               | MODE_POLL                                                                                                                                                                                    |
2023-01-27T08:15:56.946499700Z |              | 1                                                                                                                                                                                            |
2023-01-27T08:15:56.946501900Z |                       | OFF                                                                                                                                                                                          |
2023-01-27T08:15:56.946503900Z |     | 268435456                                                                                                                                                                                    |
2023-01-27T08:15:56.946506000Z |         | 0                                                                                                                                                                                            |
2023-01-27T08:15:56.946527700Z | | 6.0                                                                                                                                                                                          |
2023-01-27T08:15:56.946529900Z |                 | 1                                                                                                                                                                                            |
2023-01-27T08:15:56.946532100Z |                     | 30                                                                                                                                                                                           |
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

I0127 08:15:56.950157 34 grpc_server.cc:4544] Started GRPCInferenceService at 0.0.0.0:8001
I0127 08:15:56.951006 34 http_server.cc:3242] Started HTTPService at 0.0.0.0:8000
I0127 08:15:56.992744 34 http_server.cc:180] Started Metrics Service at 0.0.0.0:8002
E0127 08:23:57.017606 34 model_repository_manager.cc:2064] Poll failed for model directory 'test_model_pytorch': failed to open text file for read /models/test_model_pytorch/config.pbtxt: No such file or directory
E0127 08:24:57.019191 34 model_repository_manager.cc:2064] Poll failed for model directory 'test_model_pytorch': failed to open text file for read /models/test_model_pytorch/config.pbtxt: No such file or directory
E0127 08:25:57.019860 34 model_repository_manager.cc:2064] Poll failed for model directory 'test_model_pytorch': failed to open text file for read /models/test_model_pytorch/config.pbtxt: No such file or directory
E0127 08:26:57.020321 34 model_repository_manager.cc:2064] Poll failed for model directory 'test_model_pytorch': failed to open text file for read /models/test_model_pytorch/config.pbtxt: No such file or directory
E0127 08:27:57.021140 34 model_repository_manager.cc:2064] Poll failed for model directory 'test_model_pytorch': failed to open text file for read /models/test_model_pytorch/config.pbtxt: No such file or directory
E0127 08:28:57.021939 34 model_repository_manager.cc:2064] Poll failed for model directory 'test_model_pytorch': failed to open text file for read /models/test_model_pytorch/config.pbtxt: No such file or directory
E0127 08:29:57.022943 34 model_repository_manager.cc:2064] Poll failed for model directory 'test_model_pytorch': failed to open text file for read /models/test_model_pytorch/config.pbtxt: No such file or directory

  				
Posted 
	2 years ago

					More  		
  Report
		
					NuttyCamel41
				
					0
					 × 1

Hi ExasperatedCrab78 , thanks for your answer! In fact I used your recommended format for passing input and output size before and changed it in my debugging process. I have just tried again but got the same error message.
Also thanks for the hint to check the log for warnings I wil do this in a moment.

  				
Posted 
	2 years ago

					More  		
  Report
		
					NuttyCamel41
				
					0
					 × 1

Wow! Awesome to hear :D

  				
Posted 
	2 years ago

					More  		
  Report
		
					ExasperatedCrab78
				
					0
					 × 1

Hi NuttyCamel41 !

Your suspicion is correct, there should be no need to specify the config.pbtxt manually, normally this file is made automatically using the information you provide using the command line.

It might be somehow silently failing to parse your CLI input to correctly build the config.pbtxt . One difference I see immediately is that you opted for "[1, 64]" notation instead of the 1 64 notation from the example. Might be worth a try to change the input format? Also please look carefully for any ClearML warnings or error logs when running the CLI command

  				
Posted 
	2 years ago

					More  		
  Report
		
					ExasperatedCrab78
				
					0
					 × 1

Hi ExasperatedCrab78 , I have a sad update on this issue. It does not seem to be completely solved yet. 😕 But I think I can at least describe it a bit better now:

Models which are located on the clearML servers (created by Task.init(..., output_uri=True) ) still run perfectly.
Models which are located on azure blob storage make different problems in different scenarios (which made me think we resolved this issue):- When I start the docker container add a model from the clearML server and afterwards add a model located on azure (on the same endpoint) I get no error and all my http requests are answered properly.
When I start the docker container with no model added and first add a model from azure, I get this error poll failed for model directory 'test_model_pytorch': failed to open text file for read /models/test_model_pytorch/config.pbtxt: No such file or directory .
When I start a docker container where a model from azure was already added before I get this error:

clearml-serving-triton        | Updating local model folder: /models
clearml-serving-triton        | Error retrieving model ID ca186e8440b84049971a0b623df36783 []
clearml-serving-triton        | Starting server: ['tritonserver', '--model-control-mode=poll', '--model-repository=/models', '--repository-poll-secs=60.0', '--metrics-port=8002', '--allow-metrics=true', '--allow-gpu-metrics=true']
clearml-serving-triton        | Traceback (most recent call last):
clearml-serving-triton        |   File "clearml_serving/engines/triton/triton_helper.py", line 540, in <module>
clearml-serving-triton        |     main()
clearml-serving-triton        |   File "clearml_serving/engines/triton/triton_helper.py", line 532, in main
clearml-serving-triton        |     helper.maintenance_daemon(
clearml-serving-triton        |   File "clearml_serving/engines/triton/triton_helper.py", line 274, in maintenance_daemon
clearml-serving-triton        |     raise ValueError("triton-server process ended with error code {}".format(error_code))
clearml-serving-triton        | ValueError: triton-server process ended with error code 1

Side note: In the meantime I also set up the docker container on a linux server and get the same error as on my windows computer with docker desktop.

I am not sure if this is really about passing the azure credentials because I feel like I have tried all possibilities which are suggested online. For a final try I directly wrote my azure account and storage key into the docker-compose-triton.yml with this syntax: AZURE_STORAGE_ACCOUNT: $(AZURE_STORAGE_ACCOUNT:-myazureaccount} At least this should work, or am I wrong? I would really appreciate hearing your thoughts on this.

  				
Posted 
	2 years ago

					More  		
  Report
		
					NuttyCamel41
				
					0
					 × 1

Doing this might actually help with the previous issue as well, because when there are multiple docker containers running they might interfere with each other 🙂

  				
Posted 
	2 years ago

					More  		
  Report
		
					ExasperatedCrab78
				
					0
					 × 1

Hmm I think we might have made it more clear in the documentation then? How would you have been helped before you figured it out? (great job BTW, thanks for the updates on it :))

  				
Posted 
	2 years ago

					More  		
  Report
		
					ExasperatedCrab78
				
					0
					 × 1

Hi ExasperatedCrab78 , thanks for your answer. 🙂 Yes sure! I will create the issue right away.

  				
Posted 
	2 years ago

					More  		
  Report
		
					NuttyCamel41
				
					0
					 × 1

I got it working!! For now I am not sure what did the trick because I tried a bunch of different things. But I will try to reproduce it and come back to this thread for other users facing this problem. So big thanks for your help, ExasperatedCrab78 !

  				
Posted 
	2 years ago

					More  		
  Report
		
					NuttyCamel41
				
					0
					 × 1

What might also help is to look inside the triton docker container while it's running. You can check the example, there should be a pbtxt file in there. Just to doublecheck that it is also in your own folder

  				
Posted 
	2 years ago

					More  		
  Report
		
					ExasperatedCrab78
				
					0
					 × 1

Ok, I have found the issue. 🙌 When I try to serve a model which is saved on azure (generated by Task.init(..., output_uri='azure://...') ) I get the poll failed for model directory 'test_model_pytorch': failed to open text file for read /models/test_model_pytorch/config.pbtxt: No such file or directory error. A model which was saved on the clearML server (generated by Task.init(..., output_uri=True) ) can be served without any problems.
For now I am not sure why this is the case as saving the model on azure works without any problems and I am using the same credentials for model serving provided in the 'example.env' file. I will dig deeper into that.

  				
Posted 
	2 years ago

					More  		
  Report
		
					NuttyCamel41
				
					0
					 × 1

Wow awesome! Really nice find! Would you mind compiling your findings to a github issue, then we can help you search better :) this info is enough to get us going at least!

  				
Posted 
	2 years ago

					More  		
  Report
		
					ExasperatedCrab78
				
					0
					 × 1

Thank you so much, sorry for the inconvenience and thank you for your patience! I've pushed it internally and we're looking for a patch 🙂

  				
Posted 
	2 years ago

					More  		
  Report
		
					ExasperatedCrab78
				
					0
					 × 1

I'll update you once I have more!

  				
Posted 
	2 years ago

					More  		
  Report
		
					ExasperatedCrab78
				
					0
					 × 1

By the way, the example which worked for me in the beginning also produces the same error now poll failed for model directory 'test_model_pytorch': failed to open text file for read /models/test_model_pytorch/config.pbtxt: No such file or directory . So there really seems to be something wrong with the docker containers.

  				
Posted 
	2 years ago

					More  		
  Report
		
					NuttyCamel41
				
					0
					 × 1

I think you are correct with your guess that the services were not shut down properly. I noticed that some services were still shown as running on the clear ml dashboard. I aborted all and at least got rid of the error ValueError: triton-server process ended with error code 1 . But the two errors you named are still there and I also got these two warnings:
clearml-serving-triton | Warning: more than one valid Controller Tasks found, using Task ID=4709b0b383a04bb1a033e99fd325dcbf
clearml-serving-triton | WARNING: [Torch-TensorRT] - Unable to read CUDA capable devices. Return status: 35
And the first warning ist definetly right! I started a lot of service controllers by calling clearml-serving create --name "serving example" as I thought I would need to, to get a fresh startup. And honestly I do not know how to shut them down. Even after restarting my computer they are still running. Could this be a thing? Do you have a solution for this warning?
Again to your advice: I am wondering why the gpu is tried to be allocated as I trained the model on the cpu and used the 'docker-compose-triton.yml' (not the http://docker-compose-triton-gpu.to start the docker container. Nevertheless my machine has a gpu installed which I used for some testing, so it is working in general but I am not sure about the docker container.

  				
Posted 
	2 years ago

					More  		
  Report
		
					NuttyCamel41
				
					0
					 × 1

Ok, I have some weird update... I shut down and restarted the docker container just to get fresh logs and now I am getting the following error message by clearml-serving-triton
` clearml-serving-triton | clearml-serving - Nvidia Triton Engine Controller
clearml-serving-triton | Warning: more than one valid Controller Tasks found, using Task ID=433aa14db3f545ad852ddf846e25dcf0
clearml-serving-triton | ClearML Task: overwriting (reusing) task id=350a5a919ff648148a3de4483878f52f
clearml-serving-triton | 2023-01-26 15:41:41,507 - clearml.Task - INFO - No repository found, storing script code instead
clearml-serving-triton | WARNING: [Torch-TensorRT] - Unable to read CUDA capable devices. Return status: 35
clearml-serving-triton | I0126 15:41:48.077867 34 libtorch.cc:1381] TRITONBACKEND_Initialize: pytorch
clearml-serving-triton | I0126 15:41:48.077927 34 libtorch.cc:1391] Triton TRITONBACKEND API version: 1.9
clearml-serving-triton | I0126 15:41:48.077932 34 libtorch.cc:1397] 'pytorch' TRITONBACKEND API version: 1.9
clearml-serving-triton | 2023-01-26 15:41:48.210347: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
clearml-serving-triton | I0126 15:41:48.239499 34 tensorflow.cc:2181] TRITONBACKEND_Initialize: tensorflow
clearml-serving-triton | I0126 15:41:48.239547 34 tensorflow.cc:2191] Triton TRITONBACKEND API version: 1.9
clearml-serving-triton | I0126 15:41:48.239552 34 tensorflow.cc:2197] 'tensorflow' TRITONBACKEND API version: 1.9
clearml-serving-triton | I0126 15:41:48.239554 34 tensorflow.cc:2221] backend configuration:
clearml-serving-triton | {}
clearml-serving-triton | I0126 15:41:48.252847 34 onnxruntime.cc:2400] TRITONBACKEND_Initialize: onnxruntime
clearml-serving-triton | I0126 15:41:48.252884 34 onnxruntime.cc:2410] Triton TRITONBACKEND API version: 1.9
clearml-serving-triton | I0126 15:41:48.252888 34 onnxruntime.cc:2416] 'onnxruntime' TRITONBACKEND API version: 1.9
clearml-serving-triton | I0126 15:41:48.252891 34 onnxruntime.cc:2446] backend configuration:
clearml-serving-triton | {}
clearml-serving-triton | I0126 15:41:48.266838 34 openvino.cc:1207] TRITONBACKEND_Initialize: openvino
clearml-serving-triton | I0126 15:41:48.266874 34 openvino.cc:1217] Triton TRITONBACKEND API version: 1.9
clearml-serving-triton | I0126 15:41:48.266878 34 openvino.cc:1223] 'openvino' TRITONBACKEND API version: 1.9
clearml-serving-triton | W0126 15:41:48.266897 34 pinned_memory_manager.cc:236] Unable to allocate pinned system memory, pinned memory pool will not be available: CUDA driver version is insufficient for CUDA runtime version
clearml-serving-triton | I0126 15:41:48.266909 34 cuda_memory_manager.cc:115] CUDA memory pool disabled
clearml-serving-triton | E0126 15:41:48.267022 34 model_repository_manager.cc:2064] Poll failed for model directory 'test_model_pytorch': failed to open text file for read /models/test_model_pytorch/config.pbtxt: No such file or directory
clearml-serving-triton | I0126 15:41:48.267101 34 server.cc:549]
clearml-serving-triton | +------------------+------+
clearml-serving-triton | | Repository Agent | Path |
clearml-serving-triton | +------------------+------+
clearml-serving-triton | +------------------+------+
clearml-serving-triton |
clearml-serving-triton | I0126 15:41:48.267129 34 server.cc:576]
clearml-serving-triton | +-------------+-------------------------------------------------------------------------+--------+
clearml-serving-triton | | Backend | Path
| Config |
clearml-serving-triton | +-------------+-------------------------------------------------------------------------+--------+
clearml-serving-triton | | pytorch | /opt/tritonserver/backends/pytorch/libtriton_pytorch.so
| {} |
clearml-serving-triton | | tensorflow | /opt/tritonserver/backends/tensorflow1/libtriton_tensorflow1.so | {} |
clearml-serving-triton | | onnxruntime | /opt/tritonserver/backends/onnxruntime/libtriton_onnxruntime.so | {} |
clearml-serving-triton | | openvino | /opt/tritonserver/backends/openvino_2021_4/libtriton_openvino_2021_4.so | {} |
clearml-serving-triton | +-------------+-------------------------------------------------------------------------+--------+
clearml-serving-triton |
clearml-serving-triton | I0126 15:41:48.267161 34 server.cc:619]
clearml-serving-triton | +-------+---------+--------+
clearml-serving-triton | | Model | Version | Status |
clearml-serving-triton | +-------+---------+--------+
clearml-serving-triton | +-------+---------+--------+
clearml-serving-triton |
clearml-serving-triton | Error: Failed to initialize NVML
clearml-serving-triton | W0126 15:41:48.268464 34 metrics.cc:571] DCGM unable to start: DCGM initialization error
clearml-serving-triton | I0126 15:41:48.268671 34 tritonserver.cc:2123]
clearml-serving-triton | +----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
clearml-serving-triton | | Option | Value

clearml-serving-triton | | server_version | 2.21.0

clearml-serving-triton | | server_extensions | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data statistics trace |
clearml-serving-triton | | model_repository_path[0] | /models

clearml-serving-triton | | model_control_mode | MODE_POLL

clearml-serving-triton | | strict_model_config | 1

clearml-serving-triton | | rate_limit | OFF

clearml-serving-triton | | pinned_memory_pool_byte_size | 268435456

clearml-serving-triton | | response_cache_byte_size | 0

clearml-serving-triton | | min_supported_compute_capability | 6.0

clearml-serving-triton | | strict_readiness | 1

clearml-serving-triton | | exit_timeout | 30

clearml-serving-triton | +----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
clearml-serving-triton |
clearml-serving-triton | I0126 15:41:48.268710 34 server.cc:250] Waiting for in-flight requests to complete.
clearml-serving-triton | I0126 15:41:48.268717 34 server.cc:266] Timeout 30: Found 0 model versions that have in-flight inferences
clearml-serving-triton | I0126 15:41:48.268722 34 server.cc:281] All models are stopped, unloading models
clearml-serving-triton | I0126 15:41:48.268727 34 server.cc:288] Timeout 30: Found 0 live models and 0 in-flight non-inference requests
clearml-serving-triton | error: creating server: Internal - failed to load all models
clearml-serving-triton | ClearML results page:
clearml-serving-triton | configuration args: Namespace(inference_task_id=None, metric_frequency=1.0, name='triton engine', project=None, serving_id='ec2c71ce833a4f91b8b29ed5ea68d6d4', t_allow_grpc=None, t_buffer_manager_thread_count=None, t_cuda_memory_pool_byte_size=None, t_grpc_infer_allocation_pool_size=None, t_grpc_port=None, t_http_port=None, t_http_thread_count=None, t_log_verbose=None, t_min_supported_compute_capability=None, t_pinned_memory_pool_byte_size=None, update_frequency=1.0)
clearml-serving-triton | String Triton Helper service
clearml-serving-triton | {'serving_id': 'ec2c71ce833a4f91b8b29ed5ea68d6d4', 'project': None, 'name': 'triton engine', 'update_frequency': 1.0, 'metric_frequency': 1.0, 'inference_task_id': None, 't_http_port': None, 't_http_thread_count': None, 't_allow_grpc': None, 't_grpc_port': None, 't_grpc_infer_allocation_pool_size': None, 't_pinned_memory_pool_byte_size': None, 't_cuda_memory_pool_byte_size': None, 't_min_supported_compute_capability': None, 't_buffer_manager_thread_count': None, 't_log_verbose': None}
clearml-serving-triton |
clearml-serving-triton | Updating local model folder: /models
clearml-serving-triton | Error retrieving model ID bd4fdc00180642ddb73bfb3d377b05f1 []
clearml-serving-triton | Starting server: ['tritonserver', '--model-control-mode=poll', '--model-repository=/models', '--repository-poll-secs=60.0', '--metrics-port=8002', '--allow-metrics=true', '--allow-gpu-metrics=true']
clearml-serving-triton | Traceback (most recent call last):
clearml-serving-triton | File "clearml_serving/engines/triton/triton_helper.py", line 540, in <module>
clearml-serving-triton | main()
clearml-serving-triton | File "clearml_serving/engines/triton/triton_helper.py", line 532, in main
clearml-serving-triton | helper.maintenance_daemon(
clearml-serving-triton | File "clearml_serving/engines/triton/triton_helper.py", line 274, in maintenance_daemon
clearml-serving-triton | raise ValueError("triton-server process ended with error code {}".format(error_code))
clearml-serving-triton | ValueError: triton-server process ended with error code 1
clearml-serving-triton exited with code 1 `

  				
Posted 
	2 years ago

					More  		
  Report
		
					NuttyCamel41
				
					0
					 × 1

I got the last bit of my issue solved. I thought for a start it would be easier to provide the AZURE_STORAGE_ACCOUNT and AZURE_STORAGE_KEY in my 'example.env' in plain text and not access my environment variables because I was not sure about the syntax. Turns out the syntax is not AZURE_STORAGE_KEY="mystoragekey123" but AZURE_STORAGE_KEY=mystoragekey123 . Same for AZURE_STORAGE_ACCOUNT . Also the syntax for accessing my environment variables is just the same as in the clearml.conf, so I now use AZURE_STORAGE_KEY=${AZURE_STORAGE_KEY} and it works just fine.
So I guess there is no need for a github issue?

  				
Posted 
	2 years ago

					More  		
  Report
		
					NuttyCamel41
				
					0
					 × 1

You're very welcome, thank you again for the great support. :)) I followed the instructions of the clearml-serving README on github None . There is one section called 'Optional: advanced setup - S3/GS/Azure access'. Maybe the syntax could be added there? I also saw the additonal link to configure the storage access, but this site focuses on setting up the clearml.conf and I was not sure how and if I could transfer it to the docker .env-file.
Also, of course, it would be great to get more hints about the cause in the error message itself. But since I'm not technically into the serving services, for sure I don't know to what extent this would be possible.

  				
Posted 
	2 years ago

					More  		
  Report
		
					NuttyCamel41
				
					0
					 × 1

Yes, with docker auto-starting containers is def a thing 🙂 We set the containers to restart automatically (a reboot will do that too) for when the container crashes it will immediately restarts, let's say in a production environment.

So the best thing to do there is to use docker ps to get all running containers and then kill them using docker kill <container_id> . Chatgpt tells me this command should kill all currently running containers:
docker rm -f $(docker ps -aq)And I think it is correct 🙂

  				
Posted 
	2 years ago

					More  		
  Report
		
					ExasperatedCrab78
				
					0
					 × 1

Hey NuttyCamel41 Thanks for coming back on this and sorry for the late reply. This looks like a bug indeed, especially because it seems to be working when coming from the clearml servers.

Would you mind just copy pasting this info into a github issue on clearml-serving repo? Then we can track the progress we make at fixing it 🙂

  				
Posted 
	2 years ago

					More  		
  Report
		
					ExasperatedCrab78
				
					0
					 × 1

I can see 2 kinds of errors:
Error: Failed to initialize NVML and Unable to allocate pinned system memory, pinned memory pool will not be available: CUDA driver version is insufficient for CUDA runtime version
These 2 lines make me think something went wrong with the GPU itself. Chances are you won't be able to run nvidia-smi this looks like a non-clearml issue 🙂 It might be that triton hogs the GPU memory if not properly closed down (doubl ctrl-c). It says the driver version is not correct? Could that be?

Second:
You still get the Poll failed for model directory 'test_model_pytorch': failed to open text file for read /models/test_model_pytorch/config.pbtxt: No such file or directory error though 😞

  				
Posted 
	2 years ago

					More  		
  Report
		
					ExasperatedCrab78
				
					0
					 × 1

Write your answer

1K Views

22 Answers

2 years ago