Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
I Have An On-Prem/Free Clearml-Server Setup With Custom S3 Back-End Storage. I'M Trying Out The Clearml-Serving Capability And Not Sure What'S Failing. When I Start The Serving Containers It Can'T Retrieve The Model:

I have an on-prem/free clearml-server setup with custom S3 back-end storage. I'm trying out the clearml-serving capability and not sure what's failing. When I start the serving containers it can't retrieve the model:
clearml-serving-triton | Updating local model folder: /models clearml-serving-triton | Error retrieving model ID 8ff15f9955794520af8d4710a99b66a7 []When I look at the clearml-server Web api, I can see the model but maybe it's not being logged correctly?
ID is: 8ff15f9955794520af8d4710a99b66a7
I was expecting an s3:// path for the model URL so maybe this means it's not being logged correctly?
CREATED AT: Oct 4 2022 14:59 UPDATED AT: Oct 4 2022 15:00 FRAMEWORK: PyTorch STATUS: Published MODEL URL: file:///tmp/tmpl_bxlmu2/model/data/model.pth USER: DCID Admin CREATING EXPERIMENT: DeepSig simple NN ARCHIVED: No PROJECT: DeepSatcom DESCRIPTION: snapshot /tmp/tmpl_bxlmu2/model/data/model.pth stored Created by task id: ac8d8624f6da4ce3aeb78533fa8d9d

  
  
Posted one year ago
Votes Newest

Answers 21


When I start the serving containers it can't retrieve the model:

Hi BrightRabbit75
I think you need to pass the credentials for your S3 account to the clearml-serving containers
Basically just add AWS_ACCESS_KEY_ID , AWS_SECRET_ACCESS_KEY to your docker compose:

https://github.com/allegroai/clearml-serving/blob/4b52103636bc7430d4a6666ee85fd126fcb49e2e/docker/docker-compose-triton-gpu.yml#L110
https://github.com/allegroai/clearml-serving/blob/4b52103636bc7430d4a6666ee85fd126fcb49e2e/docker/docker-compose-triton-gpu.yml#L83

  
  
Posted one year ago

yep, that's the reason it is failing, how did you train the model itself ?

  
  
Posted one year ago

AgitatedDove14 - fantastic support! I was indeed missing the output_uri, I evidently commented it out with a "FIXME - why is this here?" comment. So now I see the model on the S3 server and the Web UI properly shows its path:
...I've removed the model from the serving instance to start fresh and the clearml-serving docker containers all come up happy. However, when I clearml-serving model add ... it is using the wrong URL - an https:// instead of an s3:// so it can't upload the preprocess.py file:
clearml-serving - CLI for launching ClearML serving engine Serving service Task 2405af60fec342f680c41d8343a25319, Adding Model endpoint '/deep_satcom_test/' Info: syncing model endpoint configuration, state hash=d3290336c62c7fb0bc8eb4046b60bc7f Warning: Found multiple Models for '{'project_name': 'DeepSatcom', 'model_name': None, 'tags': None, 'only_published': True, 'include_archived': False}', selecting id=0f516b61b45c48509df12f93389a29ae 2022-10-05 08:40:39,319 - clearml.storage - ERROR - Failed uploading: Could not connect to the endpoint URL: " "There already was a preprocess.py file on the S3 server so something must have worked correctly earlier. If I do model add without the --preprocess option, it completes fine.

  
  
Posted one year ago

Also - I don't think I'm saving the model correctly. It appears I need to converted into TorchScript?

  
  
Posted one year ago

All this mess was caused by a docker bridge IP subnet clash with our VPN subnet. 😞
Thanks for all the help AgitatedDove14 !

  
  
Posted one year ago

I've added that to the example.env. Same creds/etc from the clearml.conf and I can see the metrics/artifacts on the S3 server. I can't find any actual model files on the server though. However, I swear it worked once.

Next time online I'll attach to the containers and verify the AWS creds are present. I guess there's a way to request the model from the cli/python env that I can test as well.

  
  
Posted one year ago

To auto upload the model you have to tell clearml to upload it somewhere, usually by passing output_uri to Task.init or setting the default_output_uri in the clearml.conf

  
  
Posted one year ago

clearml-serving --id ${SERVE_TASK_ID} model add --engine triton --preprocess "preprocess.py" --endpoint "deep_satcom_test" --project "DeepSatcom" --published --input-size 4 128 1 --input-name INPUT__0 --input-type float32 --output-size 2 --output-name OUTPUT__0 --output-type float32

  
  
Posted one year ago

it is using the wrong URL - an https:// instead of an s3:// so it can't upload the preprocess.py file

What do you see as the link in the UI for that specific model ?
What is the full model add command ?

  
  
Posted one year ago

I can't seem to figure out what the names should be from the pytorch example - where did INPUT__0 come from

This is actually the latyer name in the model:
https://github.com/allegroai/clearml-serving/blob/4b52103636bc7430d4a6666ee85fd126fcb49e2e/examples/pytorch/train_pytorch_mnist.py#L24
Which is just the default name Pytorch gives the layer
https://discuss.pytorch.org/t/how-to-get-layer-names-in-a-network/134238

it appears I need to converted into TorchScript?

Yes, this is a Triton limitation (actually it is for the best, this is the optimized version of your model)

  
  
Posted one year ago

Okay that makes sense, if this is the case I'm assuming you have set the files server to point to your S3 bucket is that correct ?
could it be you are missing the credentials for that (it is trying to upload the preprocessing code there, so the clearml-serving container would be able to pull it later)

  
  
Posted one year ago

. I can't find any actual model files on the server though.

What do you mean? Do you see the specific models in the web UI? is the link valid ?

  
  
Posted one year ago

In clearml.conf:
files_server: " "

  
  
Posted one year ago

"This is Not a an S3 endpoint... what is the files server you configured for it?

  
  
Posted one year ago

I'm having trouble just verifying the bucket via boto3 directly so something else might be amiss causing this issue.

  
  
Posted one year ago

The model url (shown above) looks invalid:
file:///tmp/tmpl_bxlmu2/model/data/model.pth

I was expecting something like s3://...

  
  
Posted one year ago

No, it's a real S3 server that we have on-prem.
` aws {
s3 {
region: ""
key: ""
secret: ""
use_credentials_chain: false

        credentials: [
            {
                host: "e3-storage-grid-gateway.aero.org:443"
                key: "****"
                secret: "****"
                multipart: false
                secure: true
            }
        ]
    }
    boto3 {
        pool_connections: 512
        max_multipart_concurrency: 16
    }
} `
  
  
Posted one year ago

Yes - the files_server is set to s3://... and the credentials are in the clearml.conf in the sdk/aws/s3 section. I'm trying to debug my way through the code to see where it fails and see if I can tell what's wrong.

  
  
Posted one year ago

Ah, that's what I'm missing. Will test tomorrow. I should have started with the example instead of my existing experiment. Thanks AgitatedDove14 !

  
  
Posted one year ago

Is this like a local minio?
What do you have under the sdk/aws/s3 section ?

  
  
Posted one year ago

I'm now stuck on the actual request. I took a guess on the triton config with input/output names/etc. so I think I'm doing something wrong there. I can't seem to figure out what the names should be from the pytorch example - where did INPUT__0 come from?

  
  
Posted one year ago
571 Views
21 Answers
one year ago
one year ago
Tags