I've added that to the example.env. Same creds/etc from the clearml.conf and I can see the metrics/artifacts on the S3 server. I can't find any actual model files on the server though. However, I swear it worked once.
Next time online I'll attach to the containers and verify the AWS creds are present. I guess there's a way to request the model from the cli/python env that I can test as well.
. I can't find any actual model files on the server though.
What do you mean? Do you see the specific models in the web UI? is the link valid ?
The model url (shown above) looks invalid:
file:///tmp/tmpl_bxlmu2/model/data/model.pth
I was expecting something like s3://...
Is this like a local minio?
What do you have under the sdk/aws/s3 section
?
Yes - the files_server is set to s3://... and the credentials are in the clearml.conf in the sdk/aws/s3 section. I'm trying to debug my way through the code to see where it fails and see if I can tell what's wrong.
Ah, that's what I'm missing. Will test tomorrow. I should have started with the example instead of my existing experiment. Thanks AgitatedDove14 !
Okay that makes sense, if this is the case I'm assuming you have set the files server to point to your S3 bucket is that correct ?
could it be you are missing the credentials for that (it is trying to upload the preprocessing code there, so the clearml-serving container would be able to pull it later)
All this mess was caused by a docker bridge IP subnet clash with our VPN subnet. 😞
Thanks for all the help AgitatedDove14 !
I'm having trouble just verifying the bucket via boto3 directly so something else might be amiss causing this issue.
yep, that's the reason it is failing, how did you train the model itself ?
I can't seem to figure out what the names should be from the pytorch example - where did INPUT__0 come from
This is actually the latyer name in the model:
https://github.com/allegroai/clearml-serving/blob/4b52103636bc7430d4a6666ee85fd126fcb49e2e/examples/pytorch/train_pytorch_mnist.py#L24
Which is just the default name Pytorch gives the layer
https://discuss.pytorch.org/t/how-to-get-layer-names-in-a-network/134238
it appears I need to converted into TorchScript?
Yes, this is a Triton limitation (actually it is for the best, this is the optimized version of your model)
I'm now stuck on the actual request. I took a guess on the triton config with input/output names/etc. so I think I'm doing something wrong there. I can't seem to figure out what the names should be from the pytorch example - where did INPUT__0 come from?
When I start the serving containers it can't retrieve the model:
Hi BrightRabbit75
I think you need to pass the credentials for your S3 account to the clearml-serving containers
Basically just add AWS_ACCESS_KEY_ID
, AWS_SECRET_ACCESS_KEY
to your docker compose:
https://github.com/allegroai/clearml-serving/blob/4b52103636bc7430d4a6666ee85fd126fcb49e2e/docker/docker-compose-triton-gpu.yml#L110
https://github.com/allegroai/clearml-serving/blob/4b52103636bc7430d4a6666ee85fd126fcb49e2e/docker/docker-compose-triton-gpu.yml#L83
AgitatedDove14 - fantastic support! I was indeed missing the output_uri, I evidently commented it out with a "FIXME - why is this here?" comment. So now I see the model on the S3 server and the Web UI properly shows its path:...
I've removed the model from the serving instance to start fresh and the clearml-serving docker containers all come up happy. However, when I clearml-serving model add ...
it is using the wrong URL - an https:// instead of an s3:// so it can't upload the preprocess.py file:clearml-serving - CLI for launching ClearML serving engine Serving service Task 2405af60fec342f680c41d8343a25319, Adding Model endpoint '/deep_satcom_test/' Info: syncing model endpoint configuration, state hash=d3290336c62c7fb0bc8eb4046b60bc7f Warning: Found multiple Models for '{'project_name': 'DeepSatcom', 'model_name': None, 'tags': None, 'only_published': True, 'include_archived': False}', selecting id=0f516b61b45c48509df12f93389a29ae 2022-10-05 08:40:39,319 - clearml.storage - ERROR - Failed uploading: Could not connect to the endpoint URL: "
"
There already was a preprocess.py file on the S3 server so something must have worked correctly earlier. If I do model add
without the --preprocess option, it completes fine.
Also - I don't think I'm saving the model correctly. It appears I need to converted into TorchScript?
To auto upload the model you have to tell clearml to upload it somewhere, usually by passing output_uri to Task.init or setting the default_output_uri in the clearml.conf
"
This is Not a an S3 endpoint... what is the files server you configured for it?
No, it's a real S3 server that we have on-prem.
` aws {
s3 {
region: ""
key: ""
secret: ""
use_credentials_chain: false
credentials: [
{
host: "e3-storage-grid-gateway.aero.org:443"
key: "****"
secret: "****"
multipart: false
secure: true
}
]
}
boto3 {
pool_connections: 512
max_multipart_concurrency: 16
}
} `
clearml-serving --id ${SERVE_TASK_ID} model add --engine triton --preprocess "preprocess.py" --endpoint "deep_satcom_test" --project "DeepSatcom" --published --input-size 4 128 1 --input-name INPUT__0 --input-type float32 --output-size 2 --output-name OUTPUT__0 --output-type float32