Hi Guys, I Am Trying To Upload And Serve A Pre-Existing 3-Rdparty Pytorch Model Inside My Clearml Cluster. However, After Proceeding With The Suggested Sequence Of Operations By Official Docs And Later Even Gpt O3, I Am Having Errors Which I Cannot Solve.

Unanswered

AgitatedDove14 Please, correct me, if I am wrong: are you currently proposing the following sequence:

On a device, that hosts clearml server, I should have my model file in any directory.
Then, I should upload it to the clearml model repository as OutputModel directly?

Because today I did try to upload the model using the following script:

from clearml import Task, OutputModel

# Step 1: Initialize a Task
task = Task.init(project_name="LogSentinel", task_name="Upload and register Output DeepLog Model from .69 locally" 

# Step 2: Specify the local path to the model file
weights_filename = "/home/<username>/logs/models/deeplog_bilstm/deeplog_bestloss.pth"  

# Step 3: Create a new OutputModel and upload the weights
output_model = OutputModel(task=task, name="Output deeplog_bilstm")
output_model.set_upload_destination("file:///home/<username>/models/")
uploaded_uri = output_model.update_weights(weights_filename=weights_filename)

# Step 4: Publish the model
output_model.publish()
print(f"Model successfully registered. Uploaded URI: {uploaded_uri}")```

The model was registered with the following output:

python register_model.py

ClearML Task: created new task id=87619de0726d4b10afa13529b3789ffa

ClearML results page:



2025-02-23 00:51:49,203 - clearml.Task - INFO - No repository found, storing script code instead

2025-02-23 00:51:49,738 - clearml.Task - INFO - Completed model upload to file:///home/<username>/models/LogSentinel/Upload and register Output DeepLog Model from .69.87619de0726d4b10afa13529b3789ffa/models/deeplog_bestloss.pth

Model successfully rregistered. Uploaded URI: file:///home/<username>/models/LogSentinel/Upload and register Output DeepLog Model from .69.87619de0726d4b10afa13529b3789ffa/models/deeplog_bestloss.pth

ClearML Monitor: GPU monitoring failed getting GPU reading, switching off GPU monitoring

Afterwards I checked the status and ID of the newly uploaded model in model repository, as on screenshot below.
Then I shut down all clearml-serving dockers via docker compose down --remove-orphans , deleted all inference and serving tasks in clearml webui, checked that they disappeared in clearml-serving list too.
Afterwards, I created a new clearml-serving service and copied its ID to the ENV file.
Then, I added a clearml-serving model add with the model ID I copied from webui:

clearml-serving model add     --endpoint deepl_q
uery     --engine triton     --model-id b43dbf85bcc0493688be8cd13c9d5e71     --input-size 7 1     --in
put-type float32     --output-size 6     --output-type float32 --input-name layer_0 --output-name laye
r_99

clearml-serving - CLI for launching ClearML serving engine
Notice! serving service ID not provided, selecting the first active service
Serving service Task a26d8de575f34211ab9ed553a4b70c75, Adding Model endpoint '/deepl_query/'
Info: syncing model endpoint configuration, state hash=d3290336c62c7fb0bc8eb4046b60bc7f
Updating serving service

Finally, I started a clearml-serving-triton-gpu docker.

And I am stil getting the Triton error, that it fails to retrieve model ID, and the model ID is the same as in model repository, ClearML also moved the model file to target destination URI from the script above, so the should be in place:

2025-02-23 00:51:44
ClearML Task: overwriting (reusing) task id=33e6ebd811b041e489065b7f9877f8a9

2025-02-22 23:51:44,077 - clearml.Task - INFO - No repository found, storing script code instead

ClearML results page:



2025-02-23 00:51:44
configuration args: Namespace(inference_task_id=None, metric_frequency=1.0, name='triton engine', project=None, serving_id='a26d8de575f34211ab9ed553a4b70c75', t_allow_grpc=None, t_buffer_manager_thread_count=None, t_cuda_memory_pool_byte_size=None, t_grpc_infer_allocation_pool_size=None, t_grpc_port=None, t_http_port=None, t_http_thread_count=None, t_log_verbose=None, t_min_supported_compute_capability=None, t_pinned_memory_pool_byte_size=None, update_frequency=1.0)

String Triton Helper service
{'serving_id': 'a26d8de575f34211ab9ed553a4b70c75', 'project': None, 'name': 'triton engine', 'update_frequency': 1.0, 'metric_frequency': 1.0, 'inference_task_id': None, 't_http_port': None, 't_http_thread_count': None, 't_allow_grpc': None, 't_grpc_port': None, 't_grpc_infer_allocation_pool_size': None, 't_pinned_memory_pool_byte_size': None, 't_cuda_memory_pool_byte_size': None, 't_min_supported_compute_capability': None, 't_buffer_manager_thread_count': None, 't_log_verbose': None}

Updating local model folder: /models
Error retrieving model ID b43dbf85bcc0493688be8cd13c9d5e71 []
Starting server: ['tritonserver', '--model-control-mode=poll', '--model-repository=/models', '--repository-poll-secs=60.0', '--metrics-port=8002', '--allow-metrics=true', '--allow-gpu-metrics=true']
2025-02-23 00:51:45
Traceback (most recent call last):
  File "clearml_serving/engines/triton/triton_helper.py", line 588, in <module> 
main()
  File "clearml_serving/engines/triton/triton_helper.py", line 580, in main
    helper.maintenance_daemon(
  File "clearml_serving/engines/triton/triton_helper.py", line 274, in maintenance_daemon
    raise ValueError("triton-server process ended with error code {}".format(error_code))
ValueError: triton-server process ended with error code 1

  				
Posted 
	one month ago

					More  		
  Report
		
					PungentRobin32
				
					0
					 × 1

28 Views

0 Answers

one month ago