
Reputation
Badges 1
27 × Eureka!AgitatedDove14 - fantastic support! I was indeed missing the output_uri, I evidently commented it out with a "FIXME - why is this here?" comment. So now I see the model on the S3 server and the Web UI properly shows its path:...
I've removed the model from the serving instance to start fresh and the clearml-serving docker containers all come up happy. However, when I clearml-serving model add ...
it is using the wrong URL - an https:// instead of an s3:// so it can't upload the ...
Ah, that's what I'm missing. Will test tomorrow. I should have started with the example instead of my existing experiment. Thanks AgitatedDove14 !
I've added that to the example.env. Same creds/etc from the clearml.conf and I can see the metrics/artifacts on the S3 server. I can't find any actual model files on the server though. However, I swear it worked once.
Next time online I'll attach to the containers and verify the AWS creds are present. I guess there's a way to request the model from the cli/python env that I can test as well.
I'm now stuck on the actual request. I took a guess on the triton config with input/output names/etc. so I think I'm doing something wrong there. I can't seem to figure out what the names should be from the pytorch example - where did INPUT__0 come from?
All this mess was caused by a docker bridge IP subnet clash with our VPN subnet. 😞
Thanks for all the help AgitatedDove14 !
The model url (shown above) looks invalid:
file:///tmp/tmpl_bxlmu2/model/data/model.pth
I was expecting something like s3://...
Also - I don't think I'm saving the model correctly. It appears I need to converted into TorchScript?
No, it's a real S3 server that we have on-prem.
` aws {
s3 {
region: ""
key: ""
secret: ""
use_credentials_chain: false
credentials: [
{
host: "e3-storage-grid-gateway.aero.org:443"
key: "****"
secret: "****"
multipart: false
secure: true
}
]
}
boto3 {
...
Yes - the files_server is set to s3://... and the credentials are in the clearml.conf in the sdk/aws/s3 section. I'm trying to debug my way through the code to see where it fails and see if I can tell what's wrong.
I'm having trouble just verifying the bucket via boto3 directly so something else might be amiss causing this issue.
Check the subnets of your VPN machines and the clearml docker subnet. I've had issues where the VPN uses 172.* which matches the Docker bridge network so all responses from the Docker containers get routed internally and dumped because of it. i.e. They don't go back out to non-bridge network and back to the VPN machines.
Is that because I didn't list the bucket name in the clearml.conf?
Great! Now to tell our IT that I need more space on S3 🙂
Ok - it's the URL in the files_server that was wrong. It needs to be s3 and not https.
Nope - bucket_name in clearml.conf didn't work. Maybe default_uri somewhere?
No worries. I probably should have revisited the examples. Too much cutting/pasting on my part. Thanks so much for helping!
Also - I'm not specifying the URI when I create the Task
I added secure and region - didn't change the behavior.
Now to get clearml-data to use S3... 🙂
Same response. Should I change that in the fileserver section too?
Wait - adding the output_uri seems to work.
clearml-serving --id ${SERVE_TASK_ID} model add --engine triton --preprocess "preprocess.py" --endpoint "deep_satcom_test" --project "DeepSatcom" --published --input-size 4 128 1 --input-name INPUT__0 --input-type float32 --output-size 2 --output-name OUTPUT__0 --output-type float32