Hi Guys, I Am Trying To Upload And Serve A Pre-Existing 3-Rdparty Pytorch Model Inside My Clearml Cluster. However, After Proceeding With The Suggested Sequence Of Operations By Official Docs And Later Even Gpt O3, I Am Having Errors Which I Cannot Solve.

Answered

Hi guys, I am trying to upload and serve a pre-existing 3-rdparty PyTorch model inside my ClearML cluster. However, after proceeding with the suggested sequence of operations by official docs and later even GPT o3, I am having errors which I cannot solve. What I already did:

My infrastructure : I have 2 different linux PCs connected to each other. One has ClearML server (x.x.x.69), another one (x.x.x.68) has model file itself (.pth) and clearml-serving installed. The idea is to use one as an orchestrator and the second as a gpu worker.

What I did after connection between them was successfully tested :

On my future worker (.68) I ran the following script:

from clearml import Task, InputModel

task = Task.init(project_name='LogSentinel', task_name='Register remote model from .68')

model_file_path = "file:///10.14.158.68/home/lab-usr/logsentinel/deeplog-bestloss.pth" 

model = InputModel.import_model(
    name="deeplog_bilstm",
    weights_url=model_file_path,
    project="LogSentinel",
    framework="pytorch"
)

task.connect(model)

The model record appeared in Model registry inside ClearML webUI on orchestrator (x.x.x.69) with the same model path as in the code above.

Then, I created the clearml-serving instance:

clearml-serving create --name deeplog-inference-test --project LogSentinel

Then, I added a new model to clearml-serving:

clearml-serving model add     --endpoint deepl_query     --engine triton     --model-id 8df30222595543d3a3ac55c9e5e2fb15     --input-size 7 1     --input-type float32     --output-size 6     --output-type float32 --input-name layer_0 --output-name layer_99

And it appeared in the clearml-serving model list output.

However, when I tried to see whether the inference is actually running, I saw the always failing serving instance and the following recurrent output in the webUI console for deeplog-inference-test serving instance :

2025-02-19 03:46:34
ClearML Task: overwriting (reusing) task id=a9e120f2784a4a028103a2227eae6eae
2025-02-19 02:46:34,515 - clearml.Task - INFO - No repository found, storing script code instead
ClearML results page:


2025-02-19 03:46:34
configuration args: Namespace(inference_task_id=None, metric_frequency=1.0, name='triton engine', project=None, serving_id='19a97c40f2114f138eeb7c11a49e64cf', t_allow_grpc=None, t_buffer_manager_thread_count=None, t_cuda_memory_pool_byte_size=None, t_grpc_infer_allocation_pool_size=None, t_grpc_port=None, t_http_port=None, t_http_thread_count=None, t_log_verbose=None, t_min_supported_compute_capability=None, t_pinned_memory_pool_byte_size=None, update_frequency=1.0)
String Triton Helper service
{'serving_id': '19a97c40f2114f138eeb7c11a49e64cf', 'project': None, 'name': 'triton engine', 'update_frequency': 1.0, 'metric_frequency': 1.0, 'inference_task_id': None, 't_http_port': None, 't_http_thread_count': None, 't_allow_grpc': None, 't_grpc_port': None, 't_grpc_infer_allocation_pool_size': None, 't_pinned_memory_pool_byte_size': None, 't_cuda_memory_pool_byte_size': None, 't_min_supported_compute_capability': None, 't_buffer_manager_thread_count': None, 't_log_verbose': None}
Updating local model folder: /models
Error retrieving model ID 0c6a1c24067a49a0ac09c7e42c215b05 []
Starting server: ['tritonserver', '--model-control-mode=poll', '--model-repository=/models', '--repository-poll-secs=60.0', '--metrics-port=8002', '--allow-metrics=true', '--allow-gpu-metrics=true']
2025-02-19 03:46:35
Traceback (most recent call last):
  File "clearml_serving/engines/triton/triton_helper.py", line 588, in <module>
    main()
  File "clearml_serving/engines/triton/triton_helper.py", line 580, in main
    helper.maintenance_daemon(
  File "clearml_serving/engines/triton/triton_helper.py", line 274, in maintenance_daemon
    raise ValueError("triton-server process ended with error code {}".format(error_code))
ValueError: triton-server process ended with error code 1

What am I doing wrong or missing here and how should I fix it?
Many thanks in advance!

  				
Posted 
	7 months ago

					More
				  		
  Report
		
					PungentRobin32
				
					0
					 × 1

Votes Newest

Answers 16

but now serving is not able to locate the model file itself,

from your screen shot the file seems to be in local folder somewhere "file://" it should be in the file server or in object storage, how did it get there? how is the file server configured

  				
Posted 
	7 months ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

@<1523701205467926528:profile|AgitatedDove14> Please, correct me, if I am wrong: are you currently proposing the following sequence:

On a device, that hosts clearml server, I should have my model file in any directory.
Then, I should upload it to the clearml model repository as OutputModel directly?

Because today I did try to upload the model using the following script:

from clearml import Task, OutputModel

# Step 1: Initialize a Task
task = Task.init(project_name="LogSentinel", task_name="Upload and register Output DeepLog Model from .69 locally" 

# Step 2: Specify the local path to the model file
weights_filename = "/home/<username>/logs/models/deeplog_bilstm/deeplog_bestloss.pth"  

# Step 3: Create a new OutputModel and upload the weights
output_model = OutputModel(task=task, name="Output deeplog_bilstm")
output_model.set_upload_destination("file:///home/<username>/models/")
uploaded_uri = output_model.update_weights(weights_filename=weights_filename)

# Step 4: Publish the model
output_model.publish()
print(f"Model successfully registered. Uploaded URI: {uploaded_uri}")```

The model was registered with the following output:

python register_model.py

ClearML Task: created new task id=87619de0726d4b10afa13529b3789ffa

ClearML results page:



2025-02-23 00:51:49,203 - clearml.Task - INFO - No repository found, storing script code instead

2025-02-23 00:51:49,738 - clearml.Task - INFO - Completed model upload to file:///home/<username>/models/LogSentinel/Upload and register Output DeepLog Model from .69.87619de0726d4b10afa13529b3789ffa/models/deeplog_bestloss.pth

Model successfully rregistered. Uploaded URI: file:///home/<username>/models/LogSentinel/Upload and register Output DeepLog Model from .69.87619de0726d4b10afa13529b3789ffa/models/deeplog_bestloss.pth

ClearML Monitor: GPU monitoring failed getting GPU reading, switching off GPU monitoring

Afterwards I checked the status and ID of the newly uploaded model in model repository, as on screenshot below.
Then I shut down all clearml-serving dockers via docker compose down --remove-orphans , deleted all inference and serving tasks in clearml webui, checked that they disappeared in clearml-serving list too.
Afterwards, I created a new clearml-serving service and copied its ID to the ENV file.
Then, I added a clearml-serving model add with the model ID I copied from webui:

clearml-serving model add     --endpoint deepl_q
uery     --engine triton     --model-id b43dbf85bcc0493688be8cd13c9d5e71     --input-size 7 1     --in
put-type float32     --output-size 6     --output-type float32 --input-name layer_0 --output-name laye
r_99

clearml-serving - CLI for launching ClearML serving engine
Notice! serving service ID not provided, selecting the first active service
Serving service Task a26d8de575f34211ab9ed553a4b70c75, Adding Model endpoint '/deepl_query/'
Info: syncing model endpoint configuration, state hash=d3290336c62c7fb0bc8eb4046b60bc7f
Updating serving service

Finally, I started a clearml-serving-triton-gpu docker.

And I am stil getting the Triton error, that it fails to retrieve model ID, and the model ID is the same as in model repository, ClearML also moved the model file to target destination URI from the script above, so the should be in place:

2025-02-23 00:51:44
ClearML Task: overwriting (reusing) task id=33e6ebd811b041e489065b7f9877f8a9

2025-02-22 23:51:44,077 - clearml.Task - INFO - No repository found, storing script code instead

ClearML results page:



2025-02-23 00:51:44
configuration args: Namespace(inference_task_id=None, metric_frequency=1.0, name='triton engine', project=None, serving_id='a26d8de575f34211ab9ed553a4b70c75', t_allow_grpc=None, t_buffer_manager_thread_count=None, t_cuda_memory_pool_byte_size=None, t_grpc_infer_allocation_pool_size=None, t_grpc_port=None, t_http_port=None, t_http_thread_count=None, t_log_verbose=None, t_min_supported_compute_capability=None, t_pinned_memory_pool_byte_size=None, update_frequency=1.0)

String Triton Helper service
{'serving_id': 'a26d8de575f34211ab9ed553a4b70c75', 'project': None, 'name': 'triton engine', 'update_frequency': 1.0, 'metric_frequency': 1.0, 'inference_task_id': None, 't_http_port': None, 't_http_thread_count': None, 't_allow_grpc': None, 't_grpc_port': None, 't_grpc_infer_allocation_pool_size': None, 't_pinned_memory_pool_byte_size': None, 't_cuda_memory_pool_byte_size': None, 't_min_supported_compute_capability': None, 't_buffer_manager_thread_count': None, 't_log_verbose': None}

Updating local model folder: /models
Error retrieving model ID b43dbf85bcc0493688be8cd13c9d5e71 []
Starting server: ['tritonserver', '--model-control-mode=poll', '--model-repository=/models', '--repository-poll-secs=60.0', '--metrics-port=8002', '--allow-metrics=true', '--allow-gpu-metrics=true']
2025-02-23 00:51:45
Traceback (most recent call last):
  File "clearml_serving/engines/triton/triton_helper.py", line 588, in <module> 
main()
  File "clearml_serving/engines/triton/triton_helper.py", line 580, in main
    helper.maintenance_daemon(
  File "clearml_serving/engines/triton/triton_helper.py", line 274, in maintenance_daemon
    raise ValueError("triton-server process ended with error code {}".format(error_code))
ValueError: triton-server process ended with error code 1

  				
Posted 
	7 months ago

					More
				  		
  Report
		
					PungentRobin32
				
					0
					 × 1

Also, I tested out the reachability of an endpoint with a CURL query made from modifying the example in Clearml-serving tutorial: https://clear.ml/docs/latest/docs/clearml_serving/clearml_serving_tutorial ,
and it returns error 405: method not allowed :

curl -X POST -H "accept: application/json" -H "Content-Type: application/json" -d '{"log_sequence": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]}'



<html>
<head><title>405 Not Allowed</title></head>
<body>
<center><h1>405 Not Allowed</h1></center>
<hr><center>nginx/1.22.1</center>
</body>
</html>

However, initial clearml-serving setup guide ( https://clear.ml/docs/latest/docs/clearml_serving/clearml_serving_setup ), which I followed, gives slightly different instructions from further tutorial from link above:
tutorial has docker build and runs,
when setup has only clearml-serving service creation and docker compose up. And when trying to docker run using the full command from link above, it fails with "unable to find image clearml-serving-inference locally, pull access denied for clearml-serving-inference, repository does not exist" .

Yet, the default docker compose up from the directory clearml-serving/docker somehow runs a clearml-serving-inference container too. Yet it still doesn't accept the CURL requests to the endpoints.

  				
Posted 
	7 months ago

					More
				  		
  Report
		
					PungentRobin32
				
					0
					 × 1

Hi @<1773158043551272960:profile|PungentRobin32> , I think the issue is that it's trying to retrieve the wrong model ID

  				
Posted 
	7 months ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Hmm I just noticed:

'--rm', '', 'bash'

This is odd this is an extra argument passed as "empty text" how did that end up there? could it be you did not provide any docker image or default docker container?

  				
Posted 
	7 months ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Hi, @<1523701205467926528:profile|AgitatedDove14> , host OS is Ubuntu, I connect there via ssh.

The docker compose is of version 2 (the one that uses "docker compose" instead of older "docker-compose").

I did not pass anything to or from docker manually, only used the commands according to the official guide for clearml-serving:

pip install clearml-serving

clearml-serving create --name deeplog-inference-test --project LogSentinel

git clone



nano .env # here I added my ClearML URLs and credentials 

docker compose -f --env-file .env clearml-serving-triton-gpu.yml up -d

clearml-serving model add     --endpoint deepl_query     --engine triton     --model-id 8df30222595543d3a3ac55c9e5e2fb15     --input-size 7 1     --input-type float32     --output-size 6     --output-type float32 --input-name layer_0 --output-name layer_99

The only thing I ever did to clearml-serving dockers afterwards was docker compose down and up again.
Also, I did also accidentally create multiple services via clearml-serving create --name <> --project <> , and cannot get rid of them.
And they point to either the wrong model or to no model at all, as I have only one model seen via clearml-serving model list - and that is a model created by my command above, and in webui they point to nothing or to extremely old another model file...

  				
Posted 
	7 months ago

					More
				  		
  Report
		
					PungentRobin32
				
					0
					 × 1

@<1523701087100473344:profile|SuccessfulKoala55> Also, there's one more thing that is bugging me: I have my model files on a remote host in the same LAN (.68 machine), so I try to push them to the model storage of clearml server (.69 machine).

But as far as I understand, I must provide either the URL or local path to the model file in order for ClearML SDK to send it to server machine. So I provide the absolute local path on my .68 device.

However, when I open the model storage on .69 and choose my uploaded model, it provides me with the file:/// link, which is the LOCAL path to the file on .68 - there are no such folders etc. on .69. So I don't understand, where it actually stores or how it downloads the models to the storage...

Example :

on .68 my model file lies in / home/username/modelfiles/model.pth
When I upload this via python script as InputModel from .68 to .69 , it shows no errors whatsoever.
But on .69 ClearML server model storage the path looks like this: file:///home/username/modelfiles/model.pth
So, no remote IP of x.x.x.68 whatsoever.
I tried to reupload the model using the path of x.x.x.68/home/username/modelfiles/model.pth , and it also didn't show any errors, giving the file:///x.x.x.68/home/username/modelfiles/model.pth

But which of them is actually correct and functioning, I don't know... Should I move my model file manually to the .69 machine, where ClearML server is?

  				
Posted 
	7 months ago

					More
				  		
  Report
		
					PungentRobin32
				
					0
					 × 1

My model files are also there, just placed in some usual non-shared linux directory.

So this is the issue, How would the container Get to these models? you either need to mount the folder to the container,
or you push them to ClearML model repo with the OutputModel class , does that make sense ?

  				
Posted 
	7 months ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Hi @<1773158043551272960:profile|PungentRobin32>

1732496915556 lab03:gpuall DEBUG docker: invalid reference format.

So seems like the docker command is incorrect?! the error you are seeing is the agent failing to spin the docker, what's the OS of the host machine ?

  				
Posted 
	7 months ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Also, @<1523701205467926528:profile|AgitatedDove14> , thank you very much for your advice regarding archive - I did that, removed all current clearml-serving services, created a new one, attached its ID to the ENV file, disabled all running serving dockers and then restarted the clearml-serving-triton-gpu docker, adding a model file afterwards.

I don't see any docker run errors now in clearml webui tasks console, but now serving is not able to locate the model file itself, and that file is listed in model repository - please, take a look at the screenshots.

  				
Posted 
	7 months ago

					More
				  		
  Report
		
					PungentRobin32
				
					0
					 × 1

@<1523701205467926528:profile|AgitatedDove14> ClearML server itself and all of its components (API server etc.) are on x.x.x.69 machine.
Agents and serving are on x.x.x.68 worker machine. My model files are also there, just placed in some usual non-shared linux directory.
And I didn't do any specific configurations of the clearml fileserver docker - everything is on its defaults without a single line changed except the IP address of the ClearML server.

I tried a couple of approaches to upload my preexisting models into ClearML:

To send them directly from .68 via the following script:

from clearml import Task, InputModel

task = Task.init(project_name='LogSentinel', task_name='Register remote model from .68')

model_file_path = "file:///10.14.158.68/home/lab-usr/logsentinel/deeplog-bestloss.pth

model = InputModel.import_model(
    name="deeplog_bilstm",
    weights_url=model_file_path,
    project="LogSentinel",
    framework="pytorch"
)

task.connect(model)

It registers the model without any visible errors, it appears in the model repository.

To copy the model.pth file itself to the .69 machine, then run the script for LOCAL model file upload:

from clearml import Task, InputModel

task = Task.init(project_name='LogSentinel', task_name='Register model')

model_file_path = "file:///home/lab-usr/logsentinel/deeplog-bestloss.pth

model = InputModel.import_model(name="deeplog_bilstm", weights_url=model_file_path, project="LogSentinel", framework="pytorch")

task.connect(model)

It registers it in model storage, also no errors, but neither of them works when clearml-serving is directed to use them via clearml-serving model add , because triton serving fails with error of not being able to find model file, and requests to the endpoint return "error 405 - method not allowed".

  				
Posted 
	7 months ago

					More
				  		
  Report
		
					PungentRobin32
				
					0
					 × 1

Also, I did also accidentally create multiple services via

clearml-serving create --name <> --project <>

, and cannot get rid of them.

find them in the UI (you can go to All Projects, then in the search bar put their UIs) and archive / delete them

So the part that is confusing to me is: None
who / how is this Task running? did you also setup a "service" queue (as part of the clearml-server installation) ? What do you see under the "Info" Tab?

  				
Posted 
	7 months ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Hi @<1523701205467926528:profile|AgitatedDove14> , I don't remember it well as I initially installed ClearML about half a year ago, but as far as I remember, I didn't preconfigure any specific queue.
HOWEVER, the first thing I have done in webui - I accidentally deleted the "default" queue, and later, when my clearml agents began to fail due to its abscence, had to use api to create another queue named "my_default_queue" with "default" system tag - then it was fixed.

Here are the logs from INFO tab of the task from screenshot:

ARCHIVED:
No
CHANGED AT:
Feb 20 2025 22:39
LAST ITERATION:
N/A
STATUS MESSAGE:
N/A
STATUS REASON:
N/A
CREATED AT:
Nov 25 2024 2:19
STARTED AT:
Nov 25 2024 2:24
LAST UPDATED AT:
Feb 21 2025 8:40
COMPLETED AT:
N/A
RUN TIME:
88:06d
QUEUE:
my_default_queue
WORKER:
lab03:gpuall
PARENT TASK:
N/A
PROJECT:
LogSentinel
ID:
30fb54845e2345358a4701c117cb43b0
version
1.3.0

  				
Posted 
	7 months ago

					More
				  		
  Report
		
					PungentRobin32
				
					0
					 × 1

Can you share the output for the clearml-serving model add command?

  				
Posted 
	7 months ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Hi, @<1523701087100473344:profile|SuccessfulKoala55> Yeah, sure, please, wait a sec - I will rerun the command. :)

Here's the command and output:

clearml-serving model add     --endpoint deepl_query     --engine triton     --model-id 8df30222595543d3a3ac55c9e5e2fb15     --input-size 7 1     --input-type float32     --output-size 6     --output-type float32 --input-name layer_0 --output-name layer_99


clearml-serving - CLI for launching ClearML serving engine
Notice! serving service ID not provided, selecting the first active service
Warning: more than one valid Controller Tasks found, using Task ID=ccb7bafba16e416ba5590ca717f05de0
Serving service Task ccb7bafba16e416ba5590ca717f05de0, Adding Model endpoint '/deepl_query/'
Info: syncing model endpoint configuration, state hash=ce7bbe44e5dead79f03e9ca8e28d45a6
Warning: Model endpoint 'deepl_query' overwritten
Updating serving service

Note: I would gladly avoid triton as it requires parameters I don't even understand, but seems there is no other option to run the pytorch or other neural network models otherwise.
Also, GPT suggested, that there must be some preprocessing for model file itself to convert it from PTH to something called ONNX, but I have no idea what it is and whether it is actually needed.

  				
Posted 
	7 months ago

					More
				  		
  Report
		
					PungentRobin32
				
					0
					 × 1

Ok, @<1523701087100473344:profile|SuccessfulKoala55> , I was partially able to find one of the incorrect parts of my serving setup:

Pytorch models inference require me to have .ENV file and clearml-serving-triton-gpu docker configured and running.
Configuration of .ENV requires me to provide the clearml-serving Service ID, which was created by clearml-serving create.
I have multiple services created via that command, as there is no command to remove the others, only to create additional ones.
I found the serving service and its ID, which is automatically bound to run models, and it operates differently - no messages about failing to find models.
BUT INSTEAD: it fails with Kafka, which is by some reason running by default and awaiting brokers, clients etc. Nothing like that was discussed in docs and clearml-serving tutorial, so now I am confused even more, tbh. I didn't create or have specific endpoints or connections to Kafka and related services - I didn't modify the contents of the clearml-serving-triton docker-compose files at all, only ENV file.
Also, when I did this and restarted the triton-serving docker, the running inference tasks have multiplied for some reason. Now I have many duplicates, which do not stop from webui and there seems to be no way to remove them using the same webui... Also, they either have some sort of misconfiguration, as they either do not have an endpoint or model attached, or they have a model, but the erratic one from 3 months ago. And I listed before the only commands I used to create the serving services and add models to serving currently.
Screenshots will be attached as well as the logs.

Serving task (one with the globe icon in UI)

INFO Executing: ['docker', 'run', '-t', '-e', 'CLEARML_WORKER_ID=lab03:gpuall', '-e', 'CLEARML_DOCKER_IMAGE=', '-v', '/tmp/.clearml_agent.djxlonux.cfg:/root/clearml.conf', '-v', '/root/.clearml/apt-cache:/var/cache/apt/archives', '-v', '/root/.clearml/pip-cache:/root/.cache/pip', '-v', '/root/.clearml/pip-download-cache:/root/.clearml/pip-download-cache', '-v', '/root/.clearml/cache:/clearml_agent_cache', '-v', '/root/.clearml/vcs-cache:/root/.clearml/vcs-cache', '--rm', '', 'bash', '-c', 'echo \'Binary::apt::APT::Keep-Downloaded-Packages "true";\' > /etc/apt/apt.conf.d/docker-clean ; chown -R root /root/.cache/pip ; export DEBIAN_FRONTEND=noninteractive ; apt-get update ; apt-get install -y git libsm6 libxext6 libxrender-dev libglib2.0-0 ; declare LOCAL_PYTHON ; for i in {10..5}; do which python3.$i && python3.$i -m pip --version && export LOCAL_PYTHON=$(which python3.$i) && break ; done ; [ ! -z $LOCAL_PYTHON ] || apt-get install -y python3-pip ; [ ! -z $LOCAL_PYTHON ] || export LOCAL_PYTHON=python3 ; $LOCAL_PYTHON -m pip install -U "pip<20.2" ; $LOCAL_PYTHON -m pip install -U clearml-agent==0.17.1 ; cp /root/clearml.conf /root/default_clearml.conf ; NVIDIA_VISIBLE_DEVICES=all $LOCAL_PYTHON -u -m clearml_agent execute --disable-monitoring  --id 30fb54845e2345358a4701c117cb43b0']
1732496915556 lab03:gpuall DEBUG docker: invalid reference format.
See 'docker run --help'.

What did I do wrong, please, and why did the restart of the clearml-serving-triton docker compose produce even more service tasks? :D

  				
Posted 
	7 months ago

					More
				  		
  Report
		
					PungentRobin32
				
					0
					 × 1

Write your answer

755 Views

16 Answers

7 months ago