When I start the serving containers it can't retrieve the model:
Hi BrightRabbit75
I think you need to pass the credentials for your S3 account to the clearml-serving containers
Basically just add AWS_ACCESS_KEY_ID
, AWS_SECRET_ACCESS_KEY
to your docker compose:
https://github.com/allegroai/clearml-serving/blob/4b52103636bc7430d4a6666ee85fd126fcb49e2e/docker/docker-compose-triton-gpu.yml#L110
https://github.com/allegroai/clearml-serving/blob/4b52103636bc7430d4a6666ee85fd126fcb49e2e/docker/docker-compose-triton-gpu.yml#L83
I've added that to the example.env. Same creds/etc from the clearml.conf and I can see the metrics/artifacts on the S3 server. I can't find any actual model files on the server though. However, I swear it worked once.
Next time online I'll attach to the containers and verify the AWS creds are present. I guess there's a way to request the model from the cli/python env that I can test as well.
. I can't find any actual model files on the server though.
What do you mean? Do you see the specific models in the web UI? is the link valid ?
The model url (shown above) looks invalid:
file:///tmp/tmpl_bxlmu2/model/data/model.pth
I was expecting something like s3://...
yep, that's the reason it is failing, how did you train the model itself ?
To auto upload the model you have to tell clearml to upload it somewhere, usually by passing output_uri to Task.init or setting the default_output_uri in the clearml.conf
Ah, that's what I'm missing. Will test tomorrow. I should have started with the example instead of my existing experiment. Thanks AgitatedDove14 !
AgitatedDove14 - fantastic support! I was indeed missing the output_uri, I evidently commented it out with a "FIXME - why is this here?" comment. So now I see the model on the S3 server and the Web UI properly shows its path:...
I've removed the model from the serving instance to start fresh and the clearml-serving docker containers all come up happy. However, when I clearml-serving model add ...
it is using the wrong URL - an https:// instead of an s3:// so it can't upload the preprocess.py file:clearml-serving - CLI for launching ClearML serving engine Serving service Task 2405af60fec342f680c41d8343a25319, Adding Model endpoint '/deep_satcom_test/' Info: syncing model endpoint configuration, state hash=d3290336c62c7fb0bc8eb4046b60bc7f Warning: Found multiple Models for '{'project_name': 'DeepSatcom', 'model_name': None, 'tags': None, 'only_published': True, 'include_archived': False}', selecting id=0f516b61b45c48509df12f93389a29ae 2022-10-05 08:40:39,319 - clearml.storage - ERROR - Failed uploading: Could not connect to the endpoint URL: "
"
There already was a preprocess.py file on the S3 server so something must have worked correctly earlier. If I do model add
without the --preprocess option, it completes fine.
clearml-serving --id ${SERVE_TASK_ID} model add --engine triton --preprocess "preprocess.py" --endpoint "deep_satcom_test" --project "DeepSatcom" --published --input-size 4 128 1 --input-name INPUT__0 --input-type float32 --output-size 2 --output-name OUTPUT__0 --output-type float32
Okay that makes sense, if this is the case I'm assuming you have set the files server to point to your S3 bucket is that correct ?
could it be you are missing the credentials for that (it is trying to upload the preprocessing code there, so the clearml-serving container would be able to pull it later)
Yes - the files_server is set to s3://... and the credentials are in the clearml.conf in the sdk/aws/s3 section. I'm trying to debug my way through the code to see where it fails and see if I can tell what's wrong.
"
This is Not a an S3 endpoint... what is the files server you configured for it?
Is this like a local minio?
What do you have under the sdk/aws/s3 section
?
No, it's a real S3 server that we have on-prem.
` aws {
s3 {
region: ""
key: ""
secret: ""
use_credentials_chain: false
credentials: [
{
host: "e3-storage-grid-gateway.aero.org:443"
key: "****"
secret: "****"
multipart: false
secure: true
}
]
}
boto3 {
pool_connections: 512
max_multipart_concurrency: 16
}
} `
I'm having trouble just verifying the bucket via boto3 directly so something else might be amiss causing this issue.
All this mess was caused by a docker bridge IP subnet clash with our VPN subnet. 😞
Thanks for all the help AgitatedDove14 !
I'm now stuck on the actual request. I took a guess on the triton config with input/output names/etc. so I think I'm doing something wrong there. I can't seem to figure out what the names should be from the pytorch example - where did INPUT__0 come from?
Also - I don't think I'm saving the model correctly. It appears I need to converted into TorchScript?
I can't seem to figure out what the names should be from the pytorch example - where did INPUT__0 come from
This is actually the latyer name in the model:
https://github.com/allegroai/clearml-serving/blob/4b52103636bc7430d4a6666ee85fd126fcb49e2e/examples/pytorch/train_pytorch_mnist.py#L24
Which is just the default name Pytorch gives the layer
https://discuss.pytorch.org/t/how-to-get-layer-names-in-a-network/134238
it appears I need to converted into TorchScript?
Yes, this is a Triton limitation (actually it is for the best, this is the optimized version of your model)