Unanswered
Hi Guys, I Am Trying To Upload And Serve A Pre-Existing 3-Rdparty Pytorch Model Inside My Clearml Cluster. However, After Proceeding With The Suggested Sequence Of Operations By Official Docs And Later Even Gpt O3, I Am Having Errors Which I Cannot Solve.
AgitatedDove14 Please, correct me, if I am wrong: are you currently proposing the following sequence:
- On a device, that hosts clearml server, I should have my model file in any directory.
- Then, I should upload it to the clearml model repository as
OutputModel
directly?
Because today I did try to upload the model using the following script:
from clearml import Task, OutputModel
# Step 1: Initialize a Task
task = Task.init(project_name="LogSentinel", task_name="Upload and register Output DeepLog Model from .69 locally"
# Step 2: Specify the local path to the model file
weights_filename = "/home/<username>/logs/models/deeplog_bilstm/deeplog_bestloss.pth"
# Step 3: Create a new OutputModel and upload the weights
output_model = OutputModel(task=task, name="Output deeplog_bilstm")
output_model.set_upload_destination("file:///home/<username>/models/")
uploaded_uri = output_model.update_weights(weights_filename=weights_filename)
# Step 4: Publish the model
output_model.publish()
print(f"Model successfully registered. Uploaded URI: {uploaded_uri}")```
The model was registered with the following output:
python register_model.py
ClearML Task: created new task id=87619de0726d4b10afa13529b3789ffa
ClearML results page:
2025-02-23 00:51:49,203 - clearml.Task - INFO - No repository found, storing script code instead
2025-02-23 00:51:49,738 - clearml.Task - INFO - Completed model upload to file:///home/<username>/models/LogSentinel/Upload and register Output DeepLog Model from .69.87619de0726d4b10afa13529b3789ffa/models/deeplog_bestloss.pth
Model successfully rregistered. Uploaded URI: file:///home/<username>/models/LogSentinel/Upload and register Output DeepLog Model from .69.87619de0726d4b10afa13529b3789ffa/models/deeplog_bestloss.pth
ClearML Monitor: GPU monitoring failed getting GPU reading, switching off GPU monitoring
- Afterwards I checked the status and ID of the newly uploaded model in model repository, as on screenshot below.
- Then I shut down all clearml-serving dockers via
docker compose down --remove-orphans
, deleted all inference and serving tasks in clearml webui, checked that they disappeared inclearml-serving list
too. - Afterwards, I created a new clearml-serving service and copied its ID to the ENV file.
- Then, I added a
clearml-serving model add
with the model ID I copied from webui:
clearml-serving model add --endpoint deepl_q
uery --engine triton --model-id b43dbf85bcc0493688be8cd13c9d5e71 --input-size 7 1 --in
put-type float32 --output-size 6 --output-type float32 --input-name layer_0 --output-name laye
r_99
clearml-serving - CLI for launching ClearML serving engine
Notice! serving service ID not provided, selecting the first active service
Serving service Task a26d8de575f34211ab9ed553a4b70c75, Adding Model endpoint '/deepl_query/'
Info: syncing model endpoint configuration, state hash=d3290336c62c7fb0bc8eb4046b60bc7f
Updating serving service
- Finally, I started a
clearml-serving-triton-gpu
docker.
And I am stil getting the Triton error, that it fails to retrieve model ID, and the model ID is the same as in model repository, ClearML also moved the model file to target destination URI from the script above, so the should be in place:
2025-02-23 00:51:44
ClearML Task: overwriting (reusing) task id=33e6ebd811b041e489065b7f9877f8a9
2025-02-22 23:51:44,077 - clearml.Task - INFO - No repository found, storing script code instead
ClearML results page:
2025-02-23 00:51:44
configuration args: Namespace(inference_task_id=None, metric_frequency=1.0, name='triton engine', project=None, serving_id='a26d8de575f34211ab9ed553a4b70c75', t_allow_grpc=None, t_buffer_manager_thread_count=None, t_cuda_memory_pool_byte_size=None, t_grpc_infer_allocation_pool_size=None, t_grpc_port=None, t_http_port=None, t_http_thread_count=None, t_log_verbose=None, t_min_supported_compute_capability=None, t_pinned_memory_pool_byte_size=None, update_frequency=1.0)
String Triton Helper service
{'serving_id': 'a26d8de575f34211ab9ed553a4b70c75', 'project': None, 'name': 'triton engine', 'update_frequency': 1.0, 'metric_frequency': 1.0, 'inference_task_id': None, 't_http_port': None, 't_http_thread_count': None, 't_allow_grpc': None, 't_grpc_port': None, 't_grpc_infer_allocation_pool_size': None, 't_pinned_memory_pool_byte_size': None, 't_cuda_memory_pool_byte_size': None, 't_min_supported_compute_capability': None, 't_buffer_manager_thread_count': None, 't_log_verbose': None}
Updating local model folder: /models
Error retrieving model ID b43dbf85bcc0493688be8cd13c9d5e71 []
Starting server: ['tritonserver', '--model-control-mode=poll', '--model-repository=/models', '--repository-poll-secs=60.0', '--metrics-port=8002', '--allow-metrics=true', '--allow-gpu-metrics=true']
2025-02-23 00:51:45
Traceback (most recent call last):
File "clearml_serving/engines/triton/triton_helper.py", line 588, in <module>
main()
File "clearml_serving/engines/triton/triton_helper.py", line 580, in main
helper.maintenance_daemon(
File "clearml_serving/engines/triton/triton_helper.py", line 274, in maintenance_daemon
raise ValueError("triton-server process ended with error code {}".format(error_code))
ValueError: triton-server process ended with error code 1
28 Views
0
Answers
one month ago
one month ago