Reputation
Badges 1
27 × Eureka!Check the subnets of your VPN machines and the clearml docker subnet. I've had issues where the VPN uses 172.* which matches the Docker bridge network so all responses from the Docker containers get routed internally and dumped because of it. i.e. They don't go back out to non-bridge network and back to the VPN machines.
Same response. Should I change that in the fileserver section too?
Is that because I didn't list the bucket name in the clearml.conf?
Ok - it's the URL in the files_server that was wrong. It needs to be s3 and not https.
I've added that to the example.env. Same creds/etc from the clearml.conf and I can see the metrics/artifacts on the S3 server. I can't find any actual model files on the server though. However, I swear it worked once.
Next time online I'll attach to the containers and verify the AWS creds are present. I guess there's a way to request the model from the cli/python env that I can test as well.
Wait - adding the output_uri seems to work.
Also - I'm not specifying the URI when I create the Task
No worries. I probably should have revisited the examples. Too much cutting/pasting on my part. Thanks so much for helping!
Great! Now to tell our IT that I need more space on S3 🙂
Ah, that's what I'm missing. Will test tomorrow. I should have started with the example instead of my existing experiment. Thanks AgitatedDove14 !
AgitatedDove14 - fantastic support! I was indeed missing the output_uri, I evidently commented it out with a "FIXME - why is this here?" comment. So now I see the model on the S3 server and the Web UI properly shows its path:...
I've removed the model from the serving instance to start fresh and the clearml-serving docker containers all come up happy. However, when I clearml-serving model add ...
it is using the wrong URL - an https:// instead of an s3:// so it can't upload the ...
I added secure and region - didn't change the behavior.
I'm now stuck on the actual request. I took a guess on the triton config with input/output names/etc. so I think I'm doing something wrong there. I can't seem to figure out what the names should be from the pytorch example - where did INPUT__0 come from?
Also - I don't think I'm saving the model correctly. It appears I need to converted into TorchScript?
No, it's a real S3 server that we have on-prem.
` aws {
s3 {
region: ""
key: ""
secret: ""
use_credentials_chain: false
credentials: [
{
host: "e3-storage-grid-gateway.aero.org:443"
key: "****"
secret: "****"
multipart: false
secure: true
}
]
}
boto3 {
...
Now to get clearml-data to use S3... 🙂
All this mess was caused by a docker bridge IP subnet clash with our VPN subnet. 😞
Thanks for all the help AgitatedDove14 !
Yes - the files_server is set to s3://... and the credentials are in the clearml.conf in the sdk/aws/s3 section. I'm trying to debug my way through the code to see where it fails and see if I can tell what's wrong.
clearml-serving --id ${SERVE_TASK_ID} model add --engine triton --preprocess "preprocess.py" --endpoint "deep_satcom_test" --project "DeepSatcom" --published --input-size 4 128 1 --input-name INPUT__0 --input-type float32 --output-size 2 --output-name OUTPUT__0 --output-type float32
The model url (shown above) looks invalid:
file:///tmp/tmpl_bxlmu2/model/data/model.pth
I was expecting something like s3://...
I'm having trouble just verifying the bucket via boto3 directly so something else might be amiss causing this issue.
Nope - bucket_name in clearml.conf didn't work. Maybe default_uri somewhere?