AgitatedDove14 Please, correct me, if I am wrong: are you currently proposing the following sequence:
- On a device, that hosts clearml server, I should have my model file in any directory.
- Then, I should upload it to the clearml model repository as
OutputModel
directly?
Because today I did try to upload the model using the following script:
from clearml import Task, OutputModel
# Step 1: Initialize a Task
task = Task.init(project_name="LogSentinel", task_name="Upload and register Output DeepLog Model from .69 locally"
# Step 2: Specify the local path to the model file
weights_filename = "/home/<username>/logs/models/deeplog_bilstm/deeplog_bestloss.pth"
# Step 3: Create a new OutputModel and upload the weights
output_model = OutputModel(task=task, name="Output deeplog_bilstm")
output_model.set_upload_destination("file:///home/<username>/models/")
uploaded_uri = output_model.update_weights(weights_filename=weights_filename)
# Step 4: Publish the model
output_model.publish()
print(f"Model successfully registered. Uploaded URI: {uploaded_uri}")```
The model was registered with the following output:
python register_model.py
ClearML Task: created new task id=87619de0726d4b10afa13529b3789ffa
ClearML results page:
2025-02-23 00:51:49,203 - clearml.Task - INFO - No repository found, storing script code instead
2025-02-23 00:51:49,738 - clearml.Task - INFO - Completed model upload to file:///home/<username>/models/LogSentinel/Upload and register Output DeepLog Model from .69.87619de0726d4b10afa13529b3789ffa/models/deeplog_bestloss.pth
Model successfully rregistered. Uploaded URI: file:///home/<username>/models/LogSentinel/Upload and register Output DeepLog Model from .69.87619de0726d4b10afa13529b3789ffa/models/deeplog_bestloss.pth
ClearML Monitor: GPU monitoring failed getting GPU reading, switching off GPU monitoring
- Afterwards I checked the status and ID of the newly uploaded model in model repository, as on screenshot below.
- Then I shut down all clearml-serving dockers via
docker compose down --remove-orphans
, deleted all inference and serving tasks in clearml webui, checked that they disappeared inclearml-serving list
too. - Afterwards, I created a new clearml-serving service and copied its ID to the ENV file.
- Then, I added a
clearml-serving model add
with the model ID I copied from webui:
clearml-serving model add --endpoint deepl_q
uery --engine triton --model-id b43dbf85bcc0493688be8cd13c9d5e71 --input-size 7 1 --in
put-type float32 --output-size 6 --output-type float32 --input-name layer_0 --output-name laye
r_99
clearml-serving - CLI for launching ClearML serving engine
Notice! serving service ID not provided, selecting the first active service
Serving service Task a26d8de575f34211ab9ed553a4b70c75, Adding Model endpoint '/deepl_query/'
Info: syncing model endpoint configuration, state hash=d3290336c62c7fb0bc8eb4046b60bc7f
Updating serving service
- Finally, I started a
clearml-serving-triton-gpu
docker.
And I am stil getting the Triton error, that it fails to retrieve model ID, and the model ID is the same as in model repository, ClearML also moved the model file to target destination URI from the script above, so the should be in place:
2025-02-23 00:51:44
ClearML Task: overwriting (reusing) task id=33e6ebd811b041e489065b7f9877f8a9
2025-02-22 23:51:44,077 - clearml.Task - INFO - No repository found, storing script code instead
ClearML results page:
2025-02-23 00:51:44
configuration args: Namespace(inference_task_id=None, metric_frequency=1.0, name='triton engine', project=None, serving_id='a26d8de575f34211ab9ed553a4b70c75', t_allow_grpc=None, t_buffer_manager_thread_count=None, t_cuda_memory_pool_byte_size=None, t_grpc_infer_allocation_pool_size=None, t_grpc_port=None, t_http_port=None, t_http_thread_count=None, t_log_verbose=None, t_min_supported_compute_capability=None, t_pinned_memory_pool_byte_size=None, update_frequency=1.0)
String Triton Helper service
{'serving_id': 'a26d8de575f34211ab9ed553a4b70c75', 'project': None, 'name': 'triton engine', 'update_frequency': 1.0, 'metric_frequency': 1.0, 'inference_task_id': None, 't_http_port': None, 't_http_thread_count': None, 't_allow_grpc': None, 't_grpc_port': None, 't_grpc_infer_allocation_pool_size': None, 't_pinned_memory_pool_byte_size': None, 't_cuda_memory_pool_byte_size': None, 't_min_supported_compute_capability': None, 't_buffer_manager_thread_count': None, 't_log_verbose': None}
Updating local model folder: /models
Error retrieving model ID b43dbf85bcc0493688be8cd13c9d5e71 []
Starting server: ['tritonserver', '--model-control-mode=poll', '--model-repository=/models', '--repository-poll-secs=60.0', '--metrics-port=8002', '--allow-metrics=true', '--allow-gpu-metrics=true']
2025-02-23 00:51:45
Traceback (most recent call last):
File "clearml_serving/engines/triton/triton_helper.py", line 588, in <module>
main()
File "clearml_serving/engines/triton/triton_helper.py", line 580, in main
helper.maintenance_daemon(
File "clearml_serving/engines/triton/triton_helper.py", line 274, in maintenance_daemon
raise ValueError("triton-server process ended with error code {}".format(error_code))
ValueError: triton-server process ended with error code 1
Hi AgitatedDove14 , I don't remember it well as I initially installed ClearML about half a year ago, but as far as I remember, I didn't preconfigure any specific queue.
HOWEVER, the first thing I have done in webui - I accidentally deleted the "default" queue, and later, when my clearml agents began to fail due to its abscence, had to use api to create another queue named "my_default_queue" with "default" system tag - then it was fixed.
Here are the logs from INFO tab of the task from screenshot:
ARCHIVED:
No
CHANGED AT:
Feb 20 2025 22:39
LAST ITERATION:
N/A
STATUS MESSAGE:
N/A
STATUS REASON:
N/A
CREATED AT:
Nov 25 2024 2:19
STARTED AT:
Nov 25 2024 2:24
LAST UPDATED AT:
Feb 21 2025 8:40
COMPLETED AT:
N/A
RUN TIME:
88:06d
QUEUE:
my_default_queue
WORKER:
lab03:gpuall
PARENT TASK:
N/A
PROJECT:
LogSentinel
ID:
30fb54845e2345358a4701c117cb43b0
version
1.3.0
Ok, SuccessfulKoala55 , I was partially able to find one of the incorrect parts of my serving setup:
- Pytorch models inference require me to have .ENV file and clearml-serving-triton-gpu docker configured and running.
- Configuration of .ENV requires me to provide the clearml-serving Service ID, which was created by clearml-serving create.
- I have multiple services created via that command, as there is no command to remove the others, only to create additional ones.
- I found the serving service and its ID, which is automatically bound to run models, and it operates differently - no messages about failing to find models.
- BUT INSTEAD: it fails with Kafka, which is by some reason running by default and awaiting brokers, clients etc. Nothing like that was discussed in docs and clearml-serving tutorial, so now I am confused even more, tbh. I didn't create or have specific endpoints or connections to Kafka and related services - I didn't modify the contents of the clearml-serving-triton docker-compose files at all, only ENV file.
- Also, when I did this and restarted the triton-serving docker, the running inference tasks have multiplied for some reason. Now I have many duplicates, which do not stop from webui and there seems to be no way to remove them using the same webui... Also, they either have some sort of misconfiguration, as they either do not have an endpoint or model attached, or they have a model, but the erratic one from 3 months ago. And I listed before the only commands I used to create the serving services and add models to serving currently.
Screenshots will be attached as well as the logs.
Serving task (one with the globe icon in UI)
INFO Executing: ['docker', 'run', '-t', '-e', 'CLEARML_WORKER_ID=lab03:gpuall', '-e', 'CLEARML_DOCKER_IMAGE=', '-v', '/tmp/.clearml_agent.djxlonux.cfg:/root/clearml.conf', '-v', '/root/.clearml/apt-cache:/var/cache/apt/archives', '-v', '/root/.clearml/pip-cache:/root/.cache/pip', '-v', '/root/.clearml/pip-download-cache:/root/.clearml/pip-download-cache', '-v', '/root/.clearml/cache:/clearml_agent_cache', '-v', '/root/.clearml/vcs-cache:/root/.clearml/vcs-cache', '--rm', '', 'bash', '-c', 'echo \'Binary::apt::APT::Keep-Downloaded-Packages "true";\' > /etc/apt/apt.conf.d/docker-clean ; chown -R root /root/.cache/pip ; export DEBIAN_FRONTEND=noninteractive ; apt-get update ; apt-get install -y git libsm6 libxext6 libxrender-dev libglib2.0-0 ; declare LOCAL_PYTHON ; for i in {10..5}; do which python3.$i && python3.$i -m pip --version && export LOCAL_PYTHON=$(which python3.$i) && break ; done ; [ ! -z $LOCAL_PYTHON ] || apt-get install -y python3-pip ; [ ! -z $LOCAL_PYTHON ] || export LOCAL_PYTHON=python3 ; $LOCAL_PYTHON -m pip install -U "pip<20.2" ; $LOCAL_PYTHON -m pip install -U clearml-agent==0.17.1 ; cp /root/clearml.conf /root/default_clearml.conf ; NVIDIA_VISIBLE_DEVICES=all $LOCAL_PYTHON -u -m clearml_agent execute --disable-monitoring --id 30fb54845e2345358a4701c117cb43b0']
1732496915556 lab03:gpuall DEBUG docker: invalid reference format.
See 'docker run --help'.
What did I do wrong, please, and why did the restart of the clearml-serving-triton docker compose produce even more service tasks? :D
SuccessfulKoala55 Also, there's one more thing that is bugging me: I have my model files on a remote host in the same LAN (.68 machine), so I try to push them to the model storage of clearml server (.69 machine).
But as far as I understand, I must provide either the URL or local path to the model file in order for ClearML SDK to send it to server machine. So I provide the absolute local path on my .68 device.
However, when I open the model storage on .69 and choose my uploaded model, it provides me with the file:/// link, which is the LOCAL path to the file on .68 - there are no such folders etc. on .69. So I don't understand, where it actually stores or how it downloads the models to the storage...
Example :
-
on .68 my model file lies in / home/username/modelfiles/model.pth
When I upload this via python script as InputModel from .68 to .69 , it shows no errors whatsoever. -
But on .69 ClearML server model storage the path looks like this: file:///home/username/modelfiles/model.pth
So, no remote IP of x.x.x.68 whatsoever. -
I tried to reupload the model using the path of x.x.x.68/home/username/modelfiles/model.pth , and it also didn't show any errors, giving the file:///x.x.x.68/home/username/modelfiles/model.pth
But which of them is actually correct and functioning, I don't know... Should I move my model file manually to the .69 machine, where ClearML server is?
Also, AgitatedDove14 , thank you very much for your advice regarding archive - I did that, removed all current clearml-serving services, created a new one, attached its ID to the ENV file, disabled all running serving dockers and then restarted the clearml-serving-triton-gpu
docker, adding a model file afterwards.
I don't see any docker run errors now in clearml webui tasks console, but now serving is not able to locate the model file itself, and that file is listed in model repository - please, take a look at the screenshots.
My model files are also there, just placed in some usual non-shared linux directory.
So this is the issue, How would the container Get to these models? you either need to mount the folder to the container,
or you push them to ClearML model repo with the OutputModel
class , does that make sense ?
Hmm I just noticed:
'--rm', '', 'bash'
This is odd this is an extra argument passed as "empty text" how did that end up there? could it be you did not provide any docker image or default docker container?
Also, I did also accidentally create multiple services via
clearml-serving create --name <> --project <>
, and cannot get rid of them.
find them in the UI (you can go to All Projects, then in the search bar put their UIs) and archive / delete them
So the part that is confusing to me is: None
who / how is this Task running? did you also setup a "service" queue (as part of the clearml-server installation) ? What do you see under the "Info" Tab?
Also, I tested out the reachability of an endpoint with a CURL query made from modifying the example in Clearml-serving tutorial: https://clear.ml/docs/latest/docs/clearml_serving/clearml_serving_tutorial ,
and it returns error 405: method not allowed
:
curl -X POST -H "accept: application/json" -H "Content-Type: application/json" -d '{"log_sequence": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]}'
<html>
<head><title>405 Not Allowed</title></head>
<body>
<center><h1>405 Not Allowed</h1></center>
<hr><center>nginx/1.22.1</center>
</body>
</html>
However, initial clearml-serving setup guide ( https://clear.ml/docs/latest/docs/clearml_serving/clearml_serving_setup ), which I followed, gives slightly different instructions from further tutorial from link above:
tutorial has docker build and runs,
when setup has only clearml-serving service creation and docker compose up. And when trying to docker run
using the full command from link above, it fails with "unable to find image clearml-serving-inference locally, pull access denied for clearml-serving-inference, repository does not exist" .
Yet, the default docker compose up
from the directory clearml-serving/docker somehow runs a clearml-serving-inference container too. Yet it still doesn't accept the CURL requests to the endpoints.
but now serving is not able to locate the model file itself,
from your screen shot the file seems to be in local folder somewhere "file://" it should be in the file server or in object storage, how did it get there? how is the file server configured
Hi PungentRobin32 , I think the issue is that it's trying to retrieve the wrong model ID
Can you share the output for the clearml-serving model add command?
Hi, SuccessfulKoala55 Yeah, sure, please, wait a sec - I will rerun the command. :)
Here's the command and output:
clearml-serving model add --endpoint deepl_query --engine triton --model-id 8df30222595543d3a3ac55c9e5e2fb15 --input-size 7 1 --input-type float32 --output-size 6 --output-type float32 --input-name layer_0 --output-name layer_99
clearml-serving - CLI for launching ClearML serving engine
Notice! serving service ID not provided, selecting the first active service
Warning: more than one valid Controller Tasks found, using Task ID=ccb7bafba16e416ba5590ca717f05de0
Serving service Task ccb7bafba16e416ba5590ca717f05de0, Adding Model endpoint '/deepl_query/'
Info: syncing model endpoint configuration, state hash=ce7bbe44e5dead79f03e9ca8e28d45a6
Warning: Model endpoint 'deepl_query' overwritten
Updating serving service
Note: I would gladly avoid triton as it requires parameters I don't even understand, but seems there is no other option to run the pytorch or other neural network models otherwise.
Also, GPT suggested, that there must be some preprocessing for model file itself to convert it from PTH to something called ONNX, but I have no idea what it is and whether it is actually needed.
Hi, AgitatedDove14 , host OS is Ubuntu, I connect there via ssh.
The docker compose is of version 2 (the one that uses "docker compose" instead of older "docker-compose").
I did not pass anything to or from docker manually, only used the commands according to the official guide for clearml-serving:
pip install clearml-serving
clearml-serving create --name deeplog-inference-test --project LogSentinel
git clone
nano .env # here I added my ClearML URLs and credentials
docker compose -f --env-file .env clearml-serving-triton-gpu.yml up -d
clearml-serving model add --endpoint deepl_query --engine triton --model-id 8df30222595543d3a3ac55c9e5e2fb15 --input-size 7 1 --input-type float32 --output-size 6 --output-type float32 --input-name layer_0 --output-name layer_99
The only thing I ever did to clearml-serving dockers afterwards was docker compose down
and up
again.
Also, I did also accidentally create multiple services via clearml-serving create --name <> --project <>
, and cannot get rid of them.
And they point to either the wrong model or to no model at all, as I have only one model seen via clearml-serving model list
- and that is a model created by my command above, and in webui they point to nothing or to extremely old another model file...
AgitatedDove14 ClearML server itself and all of its components (API server etc.) are on x.x.x.69 machine.
Agents and serving are on x.x.x.68 worker machine. My model files are also there, just placed in some usual non-shared linux directory.
And I didn't do any specific configurations of the clearml fileserver docker - everything is on its defaults without a single line changed except the IP address of the ClearML server.
I tried a couple of approaches to upload my preexisting models into ClearML:
- To send them directly from .68 via the following script:
from clearml import Task, InputModel
task = Task.init(project_name='LogSentinel', task_name='Register remote model from .68')
model_file_path = "file:///10.14.158.68/home/lab-usr/logsentinel/deeplog-bestloss.pth
model = InputModel.import_model(
name="deeplog_bilstm",
weights_url=model_file_path,
project="LogSentinel",
framework="pytorch"
)
task.connect(model)
It registers the model without any visible errors, it appears in the model repository.
- To copy the model.pth file itself to the .69 machine, then run the script for LOCAL model file upload:
from clearml import Task, InputModel
task = Task.init(project_name='LogSentinel', task_name='Register model')
model_file_path = "file:///home/lab-usr/logsentinel/deeplog-bestloss.pth
model = InputModel.import_model(name="deeplog_bilstm", weights_url=model_file_path, project="LogSentinel", framework="pytorch")
task.connect(model)
It registers it in model storage, also no errors, but neither of them works when clearml-serving is directed to use them via clearml-serving model add
, because triton serving fails with error of not being able to find model file, and requests to the endpoint return "error 405 - method not allowed".
1732496915556 lab03:gpuall DEBUG docker: invalid reference format.
So seems like the docker command is incorrect?! the error you are seeing is the agent failing to spin the docker, what's the OS of the host machine ?