Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hi Guys, I Am Trying To Upload And Serve A Pre-Existing 3-Rdparty Pytorch Model Inside My Clearml Cluster. However, After Proceeding With The Suggested Sequence Of Operations By Official Docs And Later Even Gpt O3, I Am Having Errors Which I Cannot Solve.

Hi guys, I am trying to upload and serve a pre-existing 3-rdparty PyTorch model inside my ClearML cluster. However, after proceeding with the suggested sequence of operations by official docs and later even GPT o3, I am having errors which I cannot solve. What I already did:

My infrastructure : I have 2 different linux PCs connected to each other. One has ClearML server (x.x.x.69), another one (x.x.x.68) has model file itself (.pth) and clearml-serving installed. The idea is to use one as an orchestrator and the second as a gpu worker.

What I did after connection between them was successfully tested :

  • On my future worker (.68) I ran the following script:
from clearml import Task, InputModel

task = Task.init(project_name='LogSentinel', task_name='Register remote model from .68')

model_file_path = "file:///10.14.158.68/home/lab-usr/logsentinel/deeplog-bestloss.pth" 

model = InputModel.import_model(
    name="deeplog_bilstm",
    weights_url=model_file_path,
    project="LogSentinel",
    framework="pytorch"
)

task.connect(model)

The model record appeared in Model registry inside ClearML webUI on orchestrator (x.x.x.69) with the same model path as in the code above.

  1. Then, I created the clearml-serving instance:
clearml-serving create --name deeplog-inference-test --project LogSentinel

  1. Then, I added a new model to clearml-serving:
clearml-serving model add     --endpoint deepl_query     --engine triton     --model-id 8df30222595543d3a3ac55c9e5e2fb15     --input-size 7 1     --input-type float32     --output-size 6     --output-type float32 --input-name layer_0 --output-name layer_99

And it appeared in the clearml-serving model list output.

  1. However, when I tried to see whether the inference is actually running, I saw the always failing serving instance and the following recurrent output in the webUI console for deeplog-inference-test serving instance :
2025-02-19 03:46:34
ClearML Task: overwriting (reusing) task id=a9e120f2784a4a028103a2227eae6eae
2025-02-19 02:46:34,515 - clearml.Task - INFO - No repository found, storing script code instead
ClearML results page: 

2025-02-19 03:46:34
configuration args: Namespace(inference_task_id=None, metric_frequency=1.0, name='triton engine', project=None, serving_id='19a97c40f2114f138eeb7c11a49e64cf', t_allow_grpc=None, t_buffer_manager_thread_count=None, t_cuda_memory_pool_byte_size=None, t_grpc_infer_allocation_pool_size=None, t_grpc_port=None, t_http_port=None, t_http_thread_count=None, t_log_verbose=None, t_min_supported_compute_capability=None, t_pinned_memory_pool_byte_size=None, update_frequency=1.0)
String Triton Helper service
{'serving_id': '19a97c40f2114f138eeb7c11a49e64cf', 'project': None, 'name': 'triton engine', 'update_frequency': 1.0, 'metric_frequency': 1.0, 'inference_task_id': None, 't_http_port': None, 't_http_thread_count': None, 't_allow_grpc': None, 't_grpc_port': None, 't_grpc_infer_allocation_pool_size': None, 't_pinned_memory_pool_byte_size': None, 't_cuda_memory_pool_byte_size': None, 't_min_supported_compute_capability': None, 't_buffer_manager_thread_count': None, 't_log_verbose': None}
Updating local model folder: /models
Error retrieving model ID 0c6a1c24067a49a0ac09c7e42c215b05 []
Starting server: ['tritonserver', '--model-control-mode=poll', '--model-repository=/models', '--repository-poll-secs=60.0', '--metrics-port=8002', '--allow-metrics=true', '--allow-gpu-metrics=true']
2025-02-19 03:46:35
Traceback (most recent call last):
  File "clearml_serving/engines/triton/triton_helper.py", line 588, in <module>
    main()
  File "clearml_serving/engines/triton/triton_helper.py", line 580, in main
    helper.maintenance_daemon(
  File "clearml_serving/engines/triton/triton_helper.py", line 274, in maintenance_daemon
    raise ValueError("triton-server process ended with error code {}".format(error_code))
ValueError: triton-server process ended with error code 1

What am I doing wrong or missing here and how should I fix it?
Many thanks in advance!

  
  
Posted one month ago
Votes Newest

Answers 16


AgitatedDove14 Please, correct me, if I am wrong: are you currently proposing the following sequence:

  • On a device, that hosts clearml server, I should have my model file in any directory.
  • Then, I should upload it to the clearml model repository as OutputModel directly?

Because today I did try to upload the model using the following script:

from clearml import Task, OutputModel

# Step 1: Initialize a Task
task = Task.init(project_name="LogSentinel", task_name="Upload and register Output DeepLog Model from .69 locally" 

# Step 2: Specify the local path to the model file
weights_filename = "/home/<username>/logs/models/deeplog_bilstm/deeplog_bestloss.pth"  

# Step 3: Create a new OutputModel and upload the weights
output_model = OutputModel(task=task, name="Output deeplog_bilstm")
output_model.set_upload_destination("file:///home/<username>/models/")
uploaded_uri = output_model.update_weights(weights_filename=weights_filename)

# Step 4: Publish the model
output_model.publish()
print(f"Model successfully registered. Uploaded URI: {uploaded_uri}")```

The model was registered with the following output:

python register_model.py

ClearML Task: created new task id=87619de0726d4b10afa13529b3789ffa

ClearML results page: 


2025-02-23 00:51:49,203 - clearml.Task - INFO - No repository found, storing script code instead

2025-02-23 00:51:49,738 - clearml.Task - INFO - Completed model upload to file:///home/<username>/models/LogSentinel/Upload and register Output DeepLog Model from .69.87619de0726d4b10afa13529b3789ffa/models/deeplog_bestloss.pth

Model successfully rregistered. Uploaded URI: file:///home/<username>/models/LogSentinel/Upload and register Output DeepLog Model from .69.87619de0726d4b10afa13529b3789ffa/models/deeplog_bestloss.pth

ClearML Monitor: GPU monitoring failed getting GPU reading, switching off GPU monitoring

  • Afterwards I checked the status and ID of the newly uploaded model in model repository, as on screenshot below.
  • Then I shut down all clearml-serving dockers via docker compose down --remove-orphans , deleted all inference and serving tasks in clearml webui, checked that they disappeared in clearml-serving list too.
  • Afterwards, I created a new clearml-serving service and copied its ID to the ENV file.
  • Then, I added a clearml-serving model add with the model ID I copied from webui:
clearml-serving model add     --endpoint deepl_q
uery     --engine triton     --model-id b43dbf85bcc0493688be8cd13c9d5e71     --input-size 7 1     --in
put-type float32     --output-size 6     --output-type float32 --input-name layer_0 --output-name laye
r_99

clearml-serving - CLI for launching ClearML serving engine
Notice! serving service ID not provided, selecting the first active service
Serving service Task a26d8de575f34211ab9ed553a4b70c75, Adding Model endpoint '/deepl_query/'
Info: syncing model endpoint configuration, state hash=d3290336c62c7fb0bc8eb4046b60bc7f
Updating serving service

  • Finally, I started a clearml-serving-triton-gpu docker.

And I am stil getting the Triton error, that it fails to retrieve model ID, and the model ID is the same as in model repository, ClearML also moved the model file to target destination URI from the script above, so the should be in place:

2025-02-23 00:51:44
ClearML Task: overwriting (reusing) task id=33e6ebd811b041e489065b7f9877f8a9

2025-02-22 23:51:44,077 - clearml.Task - INFO - No repository found, storing script code instead

ClearML results page: 


2025-02-23 00:51:44
configuration args: Namespace(inference_task_id=None, metric_frequency=1.0, name='triton engine', project=None, serving_id='a26d8de575f34211ab9ed553a4b70c75', t_allow_grpc=None, t_buffer_manager_thread_count=None, t_cuda_memory_pool_byte_size=None, t_grpc_infer_allocation_pool_size=None, t_grpc_port=None, t_http_port=None, t_http_thread_count=None, t_log_verbose=None, t_min_supported_compute_capability=None, t_pinned_memory_pool_byte_size=None, update_frequency=1.0)

String Triton Helper service
{'serving_id': 'a26d8de575f34211ab9ed553a4b70c75', 'project': None, 'name': 'triton engine', 'update_frequency': 1.0, 'metric_frequency': 1.0, 'inference_task_id': None, 't_http_port': None, 't_http_thread_count': None, 't_allow_grpc': None, 't_grpc_port': None, 't_grpc_infer_allocation_pool_size': None, 't_pinned_memory_pool_byte_size': None, 't_cuda_memory_pool_byte_size': None, 't_min_supported_compute_capability': None, 't_buffer_manager_thread_count': None, 't_log_verbose': None}

Updating local model folder: /models
Error retrieving model ID b43dbf85bcc0493688be8cd13c9d5e71 []
Starting server: ['tritonserver', '--model-control-mode=poll', '--model-repository=/models', '--repository-poll-secs=60.0', '--metrics-port=8002', '--allow-metrics=true', '--allow-gpu-metrics=true']
2025-02-23 00:51:45
Traceback (most recent call last):
  File "clearml_serving/engines/triton/triton_helper.py", line 588, in <module> 
main()
  File "clearml_serving/engines/triton/triton_helper.py", line 580, in main
    helper.maintenance_daemon(
  File "clearml_serving/engines/triton/triton_helper.py", line 274, in maintenance_daemon
    raise ValueError("triton-server process ended with error code {}".format(error_code))
ValueError: triton-server process ended with error code 1

image

  
  
Posted one month ago

Hi AgitatedDove14 , I don't remember it well as I initially installed ClearML about half a year ago, but as far as I remember, I didn't preconfigure any specific queue.
HOWEVER, the first thing I have done in webui - I accidentally deleted the "default" queue, and later, when my clearml agents began to fail due to its abscence, had to use api to create another queue named "my_default_queue" with "default" system tag - then it was fixed.

Here are the logs from INFO tab of the task from screenshot:

ARCHIVED:
No
CHANGED AT:
Feb 20 2025 22:39
LAST ITERATION:
N/A
STATUS MESSAGE:
N/A
STATUS REASON:
N/A
CREATED AT:
Nov 25 2024 2:19
STARTED AT:
Nov 25 2024 2:24
LAST UPDATED AT:
Feb 21 2025 8:40
COMPLETED AT:
N/A
RUN TIME:
88:06d
QUEUE:
my_default_queue
WORKER:
lab03:gpuall
PARENT TASK:
N/A
PROJECT:
LogSentinel
ID:
30fb54845e2345358a4701c117cb43b0
version
1.3.0
  
  
Posted one month ago

Ok, SuccessfulKoala55 , I was partially able to find one of the incorrect parts of my serving setup:

  • Pytorch models inference require me to have .ENV file and clearml-serving-triton-gpu docker configured and running.
  • Configuration of .ENV requires me to provide the clearml-serving Service ID, which was created by clearml-serving create.
  • I have multiple services created via that command, as there is no command to remove the others, only to create additional ones.
  • I found the serving service and its ID, which is automatically bound to run models, and it operates differently - no messages about failing to find models.
  • BUT INSTEAD: it fails with Kafka, which is by some reason running by default and awaiting brokers, clients etc. Nothing like that was discussed in docs and clearml-serving tutorial, so now I am confused even more, tbh. I didn't create or have specific endpoints or connections to Kafka and related services - I didn't modify the contents of the clearml-serving-triton docker-compose files at all, only ENV file.
  • Also, when I did this and restarted the triton-serving docker, the running inference tasks have multiplied for some reason. Now I have many duplicates, which do not stop from webui and there seems to be no way to remove them using the same webui... Also, they either have some sort of misconfiguration, as they either do not have an endpoint or model attached, or they have a model, but the erratic one from 3 months ago. And I listed before the only commands I used to create the serving services and add models to serving currently.
    Screenshots will be attached as well as the logs.

Serving task (one with the globe icon in UI)

INFO Executing: ['docker', 'run', '-t', '-e', 'CLEARML_WORKER_ID=lab03:gpuall', '-e', 'CLEARML_DOCKER_IMAGE=', '-v', '/tmp/.clearml_agent.djxlonux.cfg:/root/clearml.conf', '-v', '/root/.clearml/apt-cache:/var/cache/apt/archives', '-v', '/root/.clearml/pip-cache:/root/.cache/pip', '-v', '/root/.clearml/pip-download-cache:/root/.clearml/pip-download-cache', '-v', '/root/.clearml/cache:/clearml_agent_cache', '-v', '/root/.clearml/vcs-cache:/root/.clearml/vcs-cache', '--rm', '', 'bash', '-c', 'echo \'Binary::apt::APT::Keep-Downloaded-Packages "true";\' > /etc/apt/apt.conf.d/docker-clean ; chown -R root /root/.cache/pip ; export DEBIAN_FRONTEND=noninteractive ; apt-get update ; apt-get install -y git libsm6 libxext6 libxrender-dev libglib2.0-0 ; declare LOCAL_PYTHON ; for i in {10..5}; do which python3.$i && python3.$i -m pip --version && export LOCAL_PYTHON=$(which python3.$i) && break ; done ; [ ! -z $LOCAL_PYTHON ] || apt-get install -y python3-pip ; [ ! -z $LOCAL_PYTHON ] || export LOCAL_PYTHON=python3 ; $LOCAL_PYTHON -m pip install -U "pip<20.2" ; $LOCAL_PYTHON -m pip install -U clearml-agent==0.17.1 ; cp /root/clearml.conf /root/default_clearml.conf ; NVIDIA_VISIBLE_DEVICES=all $LOCAL_PYTHON -u -m clearml_agent execute --disable-monitoring  --id 30fb54845e2345358a4701c117cb43b0']
1732496915556 lab03:gpuall DEBUG docker: invalid reference format.
See 'docker run --help'.

What did I do wrong, please, and why did the restart of the clearml-serving-triton docker compose produce even more service tasks? :D
image

  
  
Posted one month ago

SuccessfulKoala55 Also, there's one more thing that is bugging me: I have my model files on a remote host in the same LAN (.68 machine), so I try to push them to the model storage of clearml server (.69 machine).

But as far as I understand, I must provide either the URL or local path to the model file in order for ClearML SDK to send it to server machine. So I provide the absolute local path on my .68 device.

However, when I open the model storage on .69 and choose my uploaded model, it provides me with the file:/// link, which is the LOCAL path to the file on .68 - there are no such folders etc. on .69. So I don't understand, where it actually stores or how it downloads the models to the storage...

Example :

  1. on .68 my model file lies in / home/username/modelfiles/model.pth
    When I upload this via python script as InputModel from .68 to .69 , it shows no errors whatsoever.

  2. But on .69 ClearML server model storage the path looks like this: file:///home/username/modelfiles/model.pth
    So, no remote IP of x.x.x.68 whatsoever.

  3. I tried to reupload the model using the path of x.x.x.68/home/username/modelfiles/model.pth , and it also didn't show any errors, giving the file:///x.x.x.68/home/username/modelfiles/model.pth

But which of them is actually correct and functioning, I don't know... Should I move my model file manually to the .69 machine, where ClearML server is?

  
  
Posted one month ago

Also, AgitatedDove14 , thank you very much for your advice regarding archive - I did that, removed all current clearml-serving services, created a new one, attached its ID to the ENV file, disabled all running serving dockers and then restarted the clearml-serving-triton-gpu docker, adding a model file afterwards.

I don't see any docker run errors now in clearml webui tasks console, but now serving is not able to locate the model file itself, and that file is listed in model repository - please, take a look at the screenshots.
image
image
image
image
image
image

  
  
Posted one month ago

My model files are also there, just placed in some usual non-shared linux directory.

So this is the issue, How would the container Get to these models? you either need to mount the folder to the container,
or you push them to ClearML model repo with the OutputModel class , does that make sense ?

  
  
Posted one month ago

Hmm I just noticed:

'--rm', '', 'bash'

This is odd this is an extra argument passed as "empty text" how did that end up there? could it be you did not provide any docker image or default docker container?

  
  
Posted one month ago

Also, I did also accidentally create multiple services via

clearml-serving create --name <> --project <>

, and cannot get rid of them.

find them in the UI (you can go to All Projects, then in the search bar put their UIs) and archive / delete them

So the part that is confusing to me is: None
who / how is this Task running? did you also setup a "service" queue (as part of the clearml-server installation) ? What do you see under the "Info" Tab?
image

  
  
Posted one month ago

Also, I tested out the reachability of an endpoint with a CURL query made from modifying the example in Clearml-serving tutorial: https://clear.ml/docs/latest/docs/clearml_serving/clearml_serving_tutorial ,
and it returns error 405: method not allowed :

curl -X POST -H "accept: application/json" -H "Content-Type: application/json" -d '{"log_sequence": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]}' 


<html>
<head><title>405 Not Allowed</title></head>
<body>
<center><h1>405 Not Allowed</h1></center>
<hr><center>nginx/1.22.1</center>
</body>
</html>

However, initial clearml-serving setup guide ( https://clear.ml/docs/latest/docs/clearml_serving/clearml_serving_setup ), which I followed, gives slightly different instructions from further tutorial from link above:
tutorial has docker build and runs,
when setup has only clearml-serving service creation and docker compose up. And when trying to docker run using the full command from link above, it fails with "unable to find image clearml-serving-inference locally, pull access denied for clearml-serving-inference, repository does not exist" .

Yet, the default docker compose up from the directory clearml-serving/docker somehow runs a clearml-serving-inference container too. Yet it still doesn't accept the CURL requests to the endpoints.

  
  
Posted one month ago

but now serving is not able to locate the model file itself,

from your screen shot the file seems to be in local folder somewhere "file://" it should be in the file server or in object storage, how did it get there? how is the file server configured

  
  
Posted one month ago

Hi PungentRobin32 , I think the issue is that it's trying to retrieve the wrong model ID

  
  
Posted one month ago

Can you share the output for the clearml-serving model add command?

  
  
Posted one month ago

Hi, SuccessfulKoala55 Yeah, sure, please, wait a sec - I will rerun the command. :)

Here's the command and output:

clearml-serving model add     --endpoint deepl_query     --engine triton     --model-id 8df30222595543d3a3ac55c9e5e2fb15     --input-size 7 1     --input-type float32     --output-size 6     --output-type float32 --input-name layer_0 --output-name layer_99


clearml-serving - CLI for launching ClearML serving engine
Notice! serving service ID not provided, selecting the first active service
Warning: more than one valid Controller Tasks found, using Task ID=ccb7bafba16e416ba5590ca717f05de0
Serving service Task ccb7bafba16e416ba5590ca717f05de0, Adding Model endpoint '/deepl_query/'
Info: syncing model endpoint configuration, state hash=ce7bbe44e5dead79f03e9ca8e28d45a6
Warning: Model endpoint 'deepl_query' overwritten
Updating serving service

Note: I would gladly avoid triton as it requires parameters I don't even understand, but seems there is no other option to run the pytorch or other neural network models otherwise.
Also, GPT suggested, that there must be some preprocessing for model file itself to convert it from PTH to something called ONNX, but I have no idea what it is and whether it is actually needed.

  
  
Posted one month ago

Hi, AgitatedDove14 , host OS is Ubuntu, I connect there via ssh.

The docker compose is of version 2 (the one that uses "docker compose" instead of older "docker-compose").

I did not pass anything to or from docker manually, only used the commands according to the official guide for clearml-serving:

pip install clearml-serving

clearml-serving create --name deeplog-inference-test --project LogSentinel

git clone 


nano .env # here I added my ClearML URLs and credentials 

docker compose -f --env-file .env clearml-serving-triton-gpu.yml up -d

clearml-serving model add     --endpoint deepl_query     --engine triton     --model-id 8df30222595543d3a3ac55c9e5e2fb15     --input-size 7 1     --input-type float32     --output-size 6     --output-type float32 --input-name layer_0 --output-name layer_99

The only thing I ever did to clearml-serving dockers afterwards was docker compose down and up again.
Also, I did also accidentally create multiple services via clearml-serving create --name <> --project <> , and cannot get rid of them.
And they point to either the wrong model or to no model at all, as I have only one model seen via clearml-serving model list - and that is a model created by my command above, and in webui they point to nothing or to extremely old another model file...

  
  
Posted one month ago

AgitatedDove14 ClearML server itself and all of its components (API server etc.) are on x.x.x.69 machine.
Agents and serving are on x.x.x.68 worker machine. My model files are also there, just placed in some usual non-shared linux directory.
And I didn't do any specific configurations of the clearml fileserver docker - everything is on its defaults without a single line changed except the IP address of the ClearML server.

I tried a couple of approaches to upload my preexisting models into ClearML:

  • To send them directly from .68 via the following script:
from clearml import Task, InputModel

task = Task.init(project_name='LogSentinel', task_name='Register remote model from .68')

model_file_path = "file:///10.14.158.68/home/lab-usr/logsentinel/deeplog-bestloss.pth

model = InputModel.import_model(
    name="deeplog_bilstm",
    weights_url=model_file_path,
    project="LogSentinel",
    framework="pytorch"
)

task.connect(model)

It registers the model without any visible errors, it appears in the model repository.

  1. To copy the model.pth file itself to the .69 machine, then run the script for LOCAL model file upload:
from clearml import Task, InputModel

task = Task.init(project_name='LogSentinel', task_name='Register model')

model_file_path = "file:///home/lab-usr/logsentinel/deeplog-bestloss.pth

model = InputModel.import_model(name="deeplog_bilstm", weights_url=model_file_path, project="LogSentinel", framework="pytorch")

task.connect(model)

It registers it in model storage, also no errors, but neither of them works when clearml-serving is directed to use them via clearml-serving model add , because triton serving fails with error of not being able to find model file, and requests to the endpoint return "error 405 - method not allowed".

  
  
Posted one month ago

Hi PungentRobin32

1732496915556 lab03:gpuall DEBUG docker: invalid reference format.

So seems like the docker command is incorrect?! the error you are seeing is the agent failing to spin the docker, what's the OS of the host machine ?

  
  
Posted one month ago
177 Views
16 Answers
one month ago
one month ago
Tags
Similar posts