Hi Team, I Have A Hosted Clearml Server. When I Upload A Large Artifact (Around 25Mb) To The

Answered

Hi Team, I have a hosted ClearML server. When I upload a large artifact (around 25MB) to the fileserver , I get a ConnectionResetError :
File "/opt/conda/lib/python3.8/site-packages/clearml/task.py", line 1964, in upload_artifact raise exception_to_raise File "/opt/conda/lib/python3.8/site-packages/clearml/task.py", line 1945, in upload_artifact if self._artifacts_manager.upload_artifact( File "/opt/conda/lib/python3.8/site-packages/clearml/binding/artifacts.py", line 780, in upload_artifact uri = self._upload_local_file(local_filename, name, File "/opt/conda/lib/python3.8/site-packages/clearml/binding/artifacts.py", line 962, in _upload_local_file StorageManager.upload_file(local_file.as_posix(), uri, wait_for_upload=True, retries=ev.retries) File "/opt/conda/lib/python3.8/site-packages/clearml/storage/manager.py", line 80, in upload_file return CacheManager.get_cache_manager().upload_file( File "/opt/conda/lib/python3.8/site-packages/clearml/storage/cache.py", line 97, in upload_file result = helper.upload( File "/opt/conda/lib/python3.8/site-packages/clearml/storage/helper.py", line 757, in upload res = self._do_upload(src_path, dest_path, extra, cb, verbose=False, retries=retries) File "/opt/conda/lib/python3.8/site-packages/clearml/storage/helper.py", line 1189, in _do_upload raise last_ex File "/opt/conda/lib/python3.8/site-packages/clearml/storage/helper.py", line 1173, in _do_upload if not self._upload_from_file(local_path=src_path, dest_path=dest_path, extra=extra): File "/opt/conda/lib/python3.8/site-packages/clearml/storage/helper.py", line 1146, in _upload_from_file res = self._driver.upload_object( File "/opt/conda/lib/python3.8/site-packages/clearml/storage/helper.py", line 1422, in upload_object return self.upload_object_via_stream(iterator=stream, container=container, File "/opt/conda/lib/python3.8/site-packages/clearml/storage/helper.py", line 1338, in upload_object_via_stream res = container.session.post( File "/opt/conda/lib/python3.8/site-packages/requests/sessions.py", line 577, in post return self.request('POST', url, data=data, json=json, **kwargs) File "/opt/conda/lib/python3.8/site-packages/requests/sessions.py", line 529, in request resp = self.send(prep, **send_kwargs) File "/opt/conda/lib/python3.8/site-packages/clearml/backend_api/utils.py", line 85, in send return super(SessionWithTimeout, self).send(request, **kwargs) File "/opt/conda/lib/python3.8/site-packages/requests/sessions.py", line 645, in send r = adapter.send(request, **kwargs) File "/opt/conda/lib/python3.8/site-packages/requests/adapters.py", line 501, in send raise ConnectionError(err, request=request) requests.exceptions.ConnectionError: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))The uploading is good when the artifact is small (e.g. around 300KB). Any idea about the issue? I appreciate any hint.

  				
Posted 
	2 years ago

					More  		
  Report
		
					NervousRabbit2
				
					0
					 × 1

Votes Newest

Answers 12

Maybe SuccessfulKoala55 or JuicyFox94 might have some insight into this 🙂

  				
Posted 
	2 years ago

					More  		
  Report
		
					CostlyOstrich36
				
					0

Ok so we can exclude a timeout due to an ingress controller in the middle. It looks more something related connection management in Fileserver. SuccessfulKoala55 Do we have a way to pass some envvar to file manager as extraenv to mitigate or fix this behavior?

  				
Posted 
	2 years ago

					More  		
  Report
		
					JuicyFox94
				
					0
					 × 1

I don't think this is not something we can configure in the fileserver...

  				
Posted 
	2 years ago

					More  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Hi NervousRabbit2 , what version of ClearML server are you running? Also what clearml version are you using?

  				
Posted 
	2 years ago

					More  		
  Report
		
					CostlyOstrich36
				
					0

Hi JuicyFox94 , no, I expose the services using NodePort

  				
Posted 
	2 years ago

					More  		
  Report
		
					NervousRabbit2
				
					0
					 × 1

Actually 25MB is not very large

  				
Posted 
	2 years ago

					More  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Do you have Ingresses enabled?

  				
Posted 
	2 years ago

					More  		
  Report
		
					JuicyFox94
				
					0
					 × 1

It turned out that the issue was caused by my network environment. Somehow my network environment was throttled and led to the issue. Changing to a better network environment made it work.
However, when I tried to upload even larger artifacts in a row (around 200MB for each), it failed due to the failure of livenessprob and readinessprob of fileserver pod. By default, the timeout of the two probes is 1s. I increased the timeout to 100s and that fixed the issue. JuicyFox94 SuccessfulKoala55

  				
Posted 
	2 years ago

					More  		
  Report
		
					NervousRabbit2
				
					0
					 × 1

Is it possible that there is a bug in the fileserver that prevents us uploading a large file (say around 25MB)? Btw, if I switch the default output URI in the SDK to upload to a Azure blob storage instead of fileserver , the functionality works good.

  				
Posted 
	2 years ago

					More  		
  Report
		
					NervousRabbit2
				
					0
					 × 1

I know of deployments where people are uploading hundreds of MBs to the fileserver, so I don't think this is related

  				
Posted 
	2 years ago

					More  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

thanks for letting us know, I took a n ote for more tests on liveness, ty again!

  				
Posted 
	2 years ago

					More  		
  Report
		
					JuicyFox94
				
					0
					 × 1

Hi CostlyOstrich36 , I deployed the ClearML server in a k8s cluster using helm chart of version 5.5.0: https://github.com/allegroai/clearml-helm-charts/tree/clearml-5.5.0/charts/clearml , which deployed v1.9.2 server, I think.
For the SDK, I am using v1.9.1.

  				
Posted 
	2 years ago

					More  		
  Report
		
					NervousRabbit2
				
					0
					 × 1

Write your answer

1K Views

12 Answers

2 years ago