Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hi Team, I Have A Hosted Clearml Server. When I Upload A Large Artifact (Around 25Mb) To The

Hi Team, I have a hosted ClearML server. When I upload a large artifact (around 25MB) to the fileserver , I get a ConnectionResetError :
File "/opt/conda/lib/python3.8/site-packages/clearml/task.py", line 1964, in upload_artifact raise exception_to_raise File "/opt/conda/lib/python3.8/site-packages/clearml/task.py", line 1945, in upload_artifact if self._artifacts_manager.upload_artifact( File "/opt/conda/lib/python3.8/site-packages/clearml/binding/artifacts.py", line 780, in upload_artifact uri = self._upload_local_file(local_filename, name, File "/opt/conda/lib/python3.8/site-packages/clearml/binding/artifacts.py", line 962, in _upload_local_file StorageManager.upload_file(local_file.as_posix(), uri, wait_for_upload=True, retries=ev.retries) File "/opt/conda/lib/python3.8/site-packages/clearml/storage/manager.py", line 80, in upload_file return CacheManager.get_cache_manager().upload_file( File "/opt/conda/lib/python3.8/site-packages/clearml/storage/cache.py", line 97, in upload_file result = helper.upload( File "/opt/conda/lib/python3.8/site-packages/clearml/storage/helper.py", line 757, in upload res = self._do_upload(src_path, dest_path, extra, cb, verbose=False, retries=retries) File "/opt/conda/lib/python3.8/site-packages/clearml/storage/helper.py", line 1189, in _do_upload raise last_ex File "/opt/conda/lib/python3.8/site-packages/clearml/storage/helper.py", line 1173, in _do_upload if not self._upload_from_file(local_path=src_path, dest_path=dest_path, extra=extra): File "/opt/conda/lib/python3.8/site-packages/clearml/storage/helper.py", line 1146, in _upload_from_file res = self._driver.upload_object( File "/opt/conda/lib/python3.8/site-packages/clearml/storage/helper.py", line 1422, in upload_object return self.upload_object_via_stream(iterator=stream, container=container, File "/opt/conda/lib/python3.8/site-packages/clearml/storage/helper.py", line 1338, in upload_object_via_stream res = container.session.post( File "/opt/conda/lib/python3.8/site-packages/requests/sessions.py", line 577, in post return self.request('POST', url, data=data, json=json, **kwargs) File "/opt/conda/lib/python3.8/site-packages/requests/sessions.py", line 529, in request resp = self.send(prep, **send_kwargs) File "/opt/conda/lib/python3.8/site-packages/clearml/backend_api/utils.py", line 85, in send return super(SessionWithTimeout, self).send(request, **kwargs) File "/opt/conda/lib/python3.8/site-packages/requests/sessions.py", line 645, in send r = adapter.send(request, **kwargs) File "/opt/conda/lib/python3.8/site-packages/requests/adapters.py", line 501, in send raise ConnectionError(err, request=request) requests.exceptions.ConnectionError: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))The uploading is good when the artifact is small (e.g. around 300KB). Any idea about the issue? I appreciate any hint.

  
  
Posted one year ago
Votes Newest

Answers 12


Hi CostlyOstrich36 , I deployed the ClearML server in a k8s cluster using helm chart of version 5.5.0: https://github.com/allegroai/clearml-helm-charts/tree/clearml-5.5.0/charts/clearml , which deployed v1.9.2 server, I think.
For the SDK, I am using v1.9.1.

  
  
Posted one year ago

Hi @<1523701827080556544:profile|JuicyFox94> , no, I expose the services using NodePort

  
  
Posted one year ago

Actually 25MB is not very large

  
  
Posted one year ago

I know of deployments where people are uploading hundreds of MBs to the fileserver, so I don't think this is related

  
  
Posted one year ago

Maybe @<1523701087100473344:profile|SuccessfulKoala55> or @<1523701827080556544:profile|JuicyFox94> might have some insight into this 🙂

  
  
Posted one year ago

Do you have Ingresses enabled?

  
  
Posted one year ago

I don't think this is not something we can configure in the fileserver...

  
  
Posted one year ago

Hi NervousRabbit2 , what version of ClearML server are you running? Also what clearml version are you using?

  
  
Posted one year ago

Ok so we can exclude a timeout due to an ingress controller in the middle. It looks more something related connection management in Fileserver. @<1523701087100473344:profile|SuccessfulKoala55> Do we have a way to pass some envvar to file manager as extraenv to mitigate or fix this behavior?

  
  
Posted one year ago

Is it possible that there is a bug in the fileserver that prevents us uploading a large file (say around 25MB)? Btw, if I switch the default output URI in the SDK to upload to a Azure blob storage instead of fileserver , the functionality works good.

  
  
Posted one year ago

It turned out that the issue was caused by my network environment. Somehow my network environment was throttled and led to the issue. Changing to a better network environment made it work.
However, when I tried to upload even larger artifacts in a row (around 200MB for each), it failed due to the failure of livenessprob and readinessprob of fileserver pod. By default, the timeout of the two probes is 1s. I increased the timeout to 100s and that fixed the issue. @<1523701827080556544:profile|JuicyFox94> @<1523701087100473344:profile|SuccessfulKoala55>

  
  
Posted one year ago

thanks for letting us know, I took a n ote for more tests on liveness, ty again!

  
  
Posted one year ago
943 Views
12 Answers
one year ago
one year ago
Tags
Similar posts