I Have An On-Prem/Free Clearml-Server Setup With Custom S3 Back-End Storage. I'M Trying Out The Clearml-Serving Capability And Not Sure What'S Failing. When I Start The Serving Containers It Can'T Retrieve The Model:

Answered

I have an on-prem/free clearml-server setup with custom S3 back-end storage. I'm trying out the clearml-serving capability and not sure what's failing. When I start the serving containers it can't retrieve the model:
clearml-serving-triton | Updating local model folder: /models clearml-serving-triton | Error retrieving model ID 8ff15f9955794520af8d4710a99b66a7 []When I look at the clearml-server Web api, I can see the model but maybe it's not being logged correctly?
ID is: 8ff15f9955794520af8d4710a99b66a7
I was expecting an s3:// path for the model URL so maybe this means it's not being logged correctly?
CREATED AT: Oct 4 2022 14:59 UPDATED AT: Oct 4 2022 15:00 FRAMEWORK: PyTorch STATUS: Published MODEL URL: file:///tmp/tmpl_bxlmu2/model/data/model.pth USER: DCID Admin CREATING EXPERIMENT: DeepSig simple NN ARCHIVED: No PROJECT: DeepSatcom DESCRIPTION: snapshot /tmp/tmpl_bxlmu2/model/data/model.pth stored Created by task id: ac8d8624f6da4ce3aeb78533fa8d9d

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					BrightRabbit75
				
					0
					 × 1

Votes Newest

Answers 21

All this mess was caused by a docker bridge IP subnet clash with our VPN subnet. 😞
Thanks for all the help AgitatedDove14 !

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					BrightRabbit75
				
					0
					 × 1

Is this like a local minio?
What do you have under the sdk/aws/s3 section ?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

I'm now stuck on the actual request. I took a guess on the triton config with input/output names/etc. so I think I'm doing something wrong there. I can't seem to figure out what the names should be from the pytorch example - where did INPUT__0 come from?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					BrightRabbit75
				
					0
					 × 1

In clearml.conf:
files_server: " "

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					BrightRabbit75
				
					0
					 × 1

To auto upload the model you have to tell clearml to upload it somewhere, usually by passing output_uri to Task.init or setting the default_output_uri in the clearml.conf

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Also - I don't think I'm saving the model correctly. It appears I need to converted into TorchScript?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					BrightRabbit75
				
					0
					 × 1

clearml-serving --id ${SERVE_TASK_ID} model add --engine triton --preprocess "preprocess.py" --endpoint "deep_satcom_test" --project "DeepSatcom" --published --input-size 4 128 1 --input-name INPUT__0 --input-type float32 --output-size 2 --output-name OUTPUT__0 --output-type float32

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					BrightRabbit75
				
					0
					 × 1

Yes - the files_server is set to s3://... and the credentials are in the clearml.conf in the sdk/aws/s3 section. I'm trying to debug my way through the code to see where it fails and see if I can tell what's wrong.

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					BrightRabbit75
				
					0
					 × 1

Okay that makes sense, if this is the case I'm assuming you have set the files server to point to your S3 bucket is that correct ?
could it be you are missing the credentials for that (it is trying to upload the preprocessing code there, so the clearml-serving container would be able to pull it later)

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

it is using the wrong URL - an https:// instead of an s3:// so it can't upload the preprocess.py file

What do you see as the link in the UI for that specific model ?
What is the full model add command ?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Ah, that's what I'm missing. Will test tomorrow. I should have started with the example instead of my existing experiment. Thanks AgitatedDove14 !

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					BrightRabbit75
				
					0
					 × 1

. I can't find any actual model files on the server though.

What do you mean? Do you see the specific models in the web UI? is the link valid ?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

yep, that's the reason it is failing, how did you train the model itself ?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

The model url (shown above) looks invalid:
file:///tmp/tmpl_bxlmu2/model/data/model.pth

I was expecting something like s3://...

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					BrightRabbit75
				
					0
					 × 1

I'm having trouble just verifying the bucket via boto3 directly so something else might be amiss causing this issue.

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					BrightRabbit75
				
					0
					 × 1

I can't seem to figure out what the names should be from the pytorch example - where did INPUT__0 come from

This is actually the latyer name in the model:
https://github.com/allegroai/clearml-serving/blob/4b52103636bc7430d4a6666ee85fd126fcb49e2e/examples/pytorch/train_pytorch_mnist.py#L24
Which is just the default name Pytorch gives the layer
https://discuss.pytorch.org/t/how-to-get-layer-names-in-a-network/134238

it appears I need to converted into TorchScript?

Yes, this is a Triton limitation (actually it is for the best, this is the optimized version of your model)

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

No, it's a real S3 server that we have on-prem.
` aws {
s3 {
region: ""
key: ""
secret: ""
use_credentials_chain: false

        credentials: [
            {
                host: "e3-storage-grid-gateway.aero.org:443"
                key: "****"
                secret: "****"
                multipart: false
                secure: true
            }
        ]
    }
    boto3 {
        pool_connections: 512
        max_multipart_concurrency: 16
    }
} `

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					BrightRabbit75
				
					0
					 × 1

I've added that to the example.env. Same creds/etc from the clearml.conf and I can see the metrics/artifacts on the S3 server. I can't find any actual model files on the server though. However, I swear it worked once.

Next time online I'll attach to the containers and verify the AWS creds are present. I guess there's a way to request the model from the cli/python env that I can test as well.

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					BrightRabbit75
				
					0
					 × 1

AgitatedDove14 - fantastic support! I was indeed missing the output_uri, I evidently commented it out with a "FIXME - why is this here?" comment. So now I see the model on the S3 server and the Web UI properly shows its path:
...I've removed the model from the serving instance to start fresh and the clearml-serving docker containers all come up happy. However, when I clearml-serving model add ... it is using the wrong URL - an https:// instead of an s3:// so it can't upload the preprocess.py file:
clearml-serving - CLI for launching ClearML serving engine Serving service Task 2405af60fec342f680c41d8343a25319, Adding Model endpoint '/deep_satcom_test/' Info: syncing model endpoint configuration, state hash=d3290336c62c7fb0bc8eb4046b60bc7f Warning: Found multiple Models for '{'project_name': 'DeepSatcom', 'model_name': None, 'tags': None, 'only_published': True, 'include_archived': False}', selecting id=0f516b61b45c48509df12f93389a29ae 2022-10-05 08:40:39,319 - clearml.storage - ERROR - Failed uploading: Could not connect to the endpoint URL: " "There already was a preprocess.py file on the S3 server so something must have worked correctly earlier. If I do model add without the --preprocess option, it completes fine.

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					BrightRabbit75
				
					0
					 × 1

When I start the serving containers it can't retrieve the model:

Hi BrightRabbit75
I think you need to pass the credentials for your S3 account to the clearml-serving containers
Basically just add AWS_ACCESS_KEY_ID , AWS_SECRET_ACCESS_KEY to your docker compose:

https://github.com/allegroai/clearml-serving/blob/4b52103636bc7430d4a6666ee85fd126fcb49e2e/docker/docker-compose-triton-gpu.yml#L110
https://github.com/allegroai/clearml-serving/blob/4b52103636bc7430d4a6666ee85fd126fcb49e2e/docker/docker-compose-triton-gpu.yml#L83

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

"This is Not a an S3 endpoint... what is the files server you configured for it?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Write your answer

1K Views

21 Answers

2 years ago