Hello Everyone! I'M Encountering An Issue When Trying To Deploy An Endpoint For A Large-Sized Model Or Get Inference On A Large Dataset (Both Exceeding ~100Mb). It Seems That They Can Only Be Downloaded Up To About 100Mb. Is There A Way To Increase A Time

Answered

Hello everyone! I'm encountering an issue when trying to deploy an endpoint for a large-sized model or get inference on a large dataset (both exceeding ~100MB). It seems that they can only be downloaded up to about 100MB. Is there a way to increase a timeout variable somewhere to address this problem?

Here's an example log of downloading the dataset:

2024-04-02 21:26:29
2024-04-02 17:56:29,932 - clearml.storage - INFO - Downloading: 5.00MB / 483.18MB @ 2.94MBs from


2024-04-02 21:26:31
2024-04-02 17:56:31,380 - clearml.storage - INFO - Downloading: 10.00MB / 483.18MB @ 3.45MBs from


2024-04-02 21:26:32
2024-04-02 17:56:32,907 - clearml.storage - INFO - Downloading: 15.00MB / 483.18MB @ 3.27MBs from


2024-04-02 21:26:34
2024-04-02 17:56:34,492 - clearml.storage - INFO - Downloading: 20.00MB / 483.18MB @ 3.16MBs from


2024-04-02 21:26:35
2024-04-02 17:56:35,989 - clearml.storage - INFO - Downloading: 25.00MB / 483.18MB @ 3.34MBs from


2024-04-02 21:26:37
2024-04-02 17:56:37,476 - clearml.storage - INFO - Downloading: 30.00MB / 483.18MB @ 3.36MBs from


2024-04-02 21:26:39
2024-04-02 17:56:39,032 - clearml.storage - INFO - Downloading: 35.00MB / 483.18MB @ 3.21MBs from


2024-04-02 21:26:40
2024-04-02 17:56:40,685 - clearml.storage - INFO - Downloading: 40.00MB / 483.18MB @ 3.03MBs from


2024-04-02 21:26:42
2024-04-02 17:56:42,150 - clearml.storage - INFO - Downloading: 45.00MB / 483.18MB @ 3.41MBs from


2024-04-02 21:26:43
2024-04-02 17:56:43,674 - clearml.storage - INFO - Downloading: 50.00MB / 483.18MB @ 3.28MBs from


2024-04-02 21:26:45
2024-04-02 17:56:45,301 - clearml.storage - INFO - Downloading: 55.00MB / 483.18MB @ 3.07MBs from


2024-04-02 21:26:46
2024-04-02 17:56:46,770 - clearml.storage - INFO - Downloading: 60.00MB / 483.18MB @ 3.40MBs from


2024-04-02 21:26:48
2024-04-02 17:56:48,248 - clearml.storage - INFO - Downloading: 65.00MB / 483.18MB @ 3.38MBs from


2024-04-02 21:26:49
2024-04-02 17:56:49,810 - clearml.storage - INFO - Downloading: 70.00MB / 483.18MB @ 3.20MBs from


2024-04-02 21:26:51
2024-04-02 17:56:51,257 - clearml.storage - INFO - Downloading: 75.00MB / 483.18MB @ 3.46MBs from


2024-04-02 21:26:52
2024-04-02 17:56:52,724 - clearml.storage - INFO - Downloading: 80.00MB / 483.18MB @ 3.41MBs from


2024-04-02 21:26:54
2024-04-02 17:56:54,404 - clearml.storage - INFO - Downloading: 85.00MB / 483.18MB @ 2.98MBs from


2024-04-02 21:26:55
2024-04-02 17:56:55,830 - clearml.storage - INFO - Downloading: 90.00MB / 483.18MB @ 3.51MBs from


2024-04-02 21:26:57
2024-04-02 17:56:57,318 - clearml.storage - INFO - Downloading: 95.00MB / 483.18MB @ 3.36MBs from


2024-04-02 21:26:58
2024-04-02 17:56:58,846 - clearml.storage - INFO - Downloading: 100.00MB / 483.18MB @ 3.27MBs from


2024-04-02 17:56:59,679 - clearml.storage - INFO - Downloaded 100.75 MB successfully from

 , saved to /root/.clearml/cache/storage_manager/datasets/d3609f172946c9c4bd22e31631bd42af.dataset.a09c036283be4cd7835d64ba874a212c.9qj9j2m_.zip
2024-04-02 17:56:59,681 - clearml - WARNING - Exception File is not a zip file
Failed extracting zip file /root/.clearml/cache/storage_manager/datasets/d3609f172946c9c4bd22e31631bd42af.dataset.a09c036283be4cd7835d64ba874a212c.9qj9j2m_.zip

  				
Posted 
	one year ago

					More
				  		
  Report
		
					FranticWhale40
				
					0
					 × 1

Votes Newest

Answers 14

Hi @<1671689437261598720:profile|FranticWhale40>
You mean the download just fails on the remote serving node becuause it takes too long to download the model?
(basically not a serving issue per-se but a download issue)

  				
Posted 
	one year ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

@<1523701205467926528:profile|AgitatedDove14> Okay we got to the bottom of this. This was actually because of the load balancer timeout settings we had, which was also 30 seconds and confusing us.

We didn’t end up needing the above configs after all.

  				
Posted 
	one year ago

					More
				  		
  Report
		
					TimelyRabbit96
				
					0
					 × 1

Or rather any pointers to debug the problem further? Our GCP instances have a pretty fast internet connection, and we haven’t faced that problem on those instances. It’s only on this specific local machine that we’re facing this truncated download.

I say truncated because we checked the model.onnx size on the container, and it was for example 110MB whereas the original one is around 160MB.

  				
Posted 
	one year ago

					More
				  		
  Report
		
					TimelyRabbit96
				
					0
					 × 1

Thank you for your prompt response. As I installed ClearML using pip, I don't have direct access to the config file. Is there any other way to increase this timeout?

  				
Posted 
	one year ago

					More
				  		
  Report
		
					FranticWhale40
				
					0
					 × 1

Okay we got to the bottom of this. This was actually because of the load balancer timeout settings we had, which was also 30 seconds and confusing us.

Nice!
btw:

in the clearml.conf we put this:

for future reference, you are missing the sdk section:

sdk.http.timeout: 300

. notation works as well as {}

  				
Posted 
	one year ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

using the docker-compose file for the

clearml-serving

pipeline, do we also have to mount it somehow?

oh yes, you are correct the values are passed using environment variables (easier when using docker compose)
You can in addition add a mount from the host machine to a conf file,

    volumes:
      - ${PWD}/clearml.conf:/root/clearml.conf

wdyt?

  				
Posted 
	one year ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

It’s only on this specific local machine that we’re facing this truncated download.

Yes that what the log says, make sense

Seems like this still doesn’t solve the problem, how can we verify this setting has been applied correctly?

hmm exec into the container? what did you put in clearml.conf?

  				
Posted 
	one year ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Yep, that makes sense. @<1671689437261598720:profile|FranticWhale40> plz give that a try

  				
Posted 
	one year ago

					More
				  		
  Report
		
					TimelyRabbit96
				
					0
					 × 1

in the clearml.conf we put this:

http {
  timeout {
     total: 300
  }
}

is that correct?

  				
Posted 
	one year ago

					More
				  		
  Report
		
					FranticWhale40
				
					0
					 × 1

Seems like this still doesn’t solve the problem, how can we verify this setting has been applied correctly? Other than checking the clearml.conf file on the container that is

  				
Posted 
	one year ago

					More
				  		
  Report
		
					TimelyRabbit96
				
					0
					 × 1

Oh...
None
try to add to your config file:

sdk.http.timeout.total = 300

  				
Posted 
	one year ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

@<1523701205467926528:profile|AgitatedDove14> this file is not getting mounted when using the docker-compose file for the clearml-serving pipeline, do we also have to mount it somehow?

The only place I can see this file being used is in the README, like so:

Spin the inference container:

docker run -v ~/clearml.conf:/root/clearml.conf -p 8080:8080 -e CLEARML_SERVING_TASK_ID=<service_id> -e CLEARML_SERVING_POLL_FREQ=5 clearml-serving-inference:latest

  				
Posted 
	one year ago

					More
				  		
  Report
		
					TimelyRabbit96
				
					0
					 × 1

Yes exactly!

  				
Posted 
	one year ago

					More
				  		
  Report
		
					FranticWhale40
				
					0
					 × 1

As I installed ClearML using pip,

Where is the clearml-serving runs ? usually your configuration file is in ~/clearml.conf
Notice if it is not there it means it is using the defaults so just create a new one and add that line

  				
Posted 
	one year ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Write your answer

1K Views

14 Answers

one year ago