Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hi It Is Me Again, This Time Trying To Upload A Single File As Dataset But Met With The Following Error. The File Is 13.42Gb And Of Apache Arrow Format. Any Idea How To Solve This Error Please? Thank You.

hi it is me again, this time trying to upload a single file as Dataset but met with the following error. The file is 13.42GB and of Apache Arrow format. Any idea how to solve this error please? Thank you.

Generating SHA2 hash for 1 files 100%|███████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:38<00:00, 38.55s/it] Hash generation completed 0%| | 0/1 [00:00<?, ?it/s] Compressing local files, chunk 1 [remaining 1 files] 100%|██████████████████████████████████████████████████████████████████████████████████████████| 1/1 [15:37<00:00, 937.45s/it] File compression completed: total size 5.34 GB, 1 chunked stored (average size 5.34 GB) Uploading compressed dataset changes 1/1 (1 files 5.34 GB) to 2022-02-18 01:07:04,908 - clearml.storage - ERROR - Exception encountered while uploading string longer than 2147483647 bytes Traceback (most recent call last): File "project-x/upload-dataset-from-local.py", line 65, in <module> dataset.upload() File "/Users/derek/.pyenv/versions/py37/lib/python3.7/site-packages/clearml/datasets/dataset.py", line 445, in upload delete_after_upload=True, wait_on_upload=True) File "/Users/derek/.pyenv/versions/py37/lib/python3.7/site-packages/clearml/task.py", line 1685, in upload_artifact auto_pickle=auto_pickle, preview=preview, wait_on_upload=wait_on_upload) File "/Users/derek/.pyenv/versions/py37/lib/python3.7/site-packages/clearml/binding/artifacts.py", line 617, in upload_artifact wait_on_upload=wait_on_upload) File "/Users/derek/.pyenv/versions/py37/lib/python3.7/site-packages/clearml/binding/artifacts.py", line 795, in _upload_local_file StorageManager.upload_file(local_file.as_posix(), uri, wait_for_upload=True, retries=ev.retries) File "/Users/derek/.pyenv/versions/py37/lib/python3.7/site-packages/clearml/storage/manager.py", line 80, in upload_file retries=retries, File "/Users/derek/.pyenv/versions/py37/lib/python3.7/site-packages/clearml/storage/cache.py", line 81, in upload_file local_file, remote_url, async_enable=not wait_for_upload, retries=retries, File "/Users/derek/.pyenv/versions/py37/lib/python3.7/site-packages/clearml/storage/helper.py", line 575, in upload res = self._do_upload(src_path, dest_path, extra, cb, verbose=False, retries=retries) File "/Users/derek/.pyenv/versions/py37/lib/python3.7/site-packages/clearml/storage/helper.py", line 979, in _do_upload raise last_ex File "/Users/derek/.pyenv/versions/py37/lib/python3.7/site-packages/clearml/storage/helper.py", line 963, in _do_upload if not self._upload_from_file(local_path=src_path, dest_path=dest_path, extra=extra): File "/Users/derek/.pyenv/versions/py37/lib/python3.7/site-packages/clearml/storage/helper.py", line 941, in _upload_from_file extra=extra) File "/Users/derek/.pyenv/versions/py37/lib/python3.7/site-packages/clearml/storage/helper.py", line 1174, in upload_object object_name=object_name, extra=extra, callback=callback, **kwargs) File "/Users/derek/.pyenv/versions/py37/lib/python3.7/site-packages/clearml/storage/helper.py", line 1094, in upload_object_via_stream headers=container.get_headers(full_url)) File "/Users/derek/.pyenv/versions/py37/lib/python3.7/site-packages/requests/sessions.py", line 577, in post return self.request('POST', url, data=data, json=json, **kwargs) File "/Users/derek/.pyenv/versions/py37/lib/python3.7/site-packages/requests/sessions.py", line 529, in request resp = self.send(prep, **send_kwargs) File "/Users/derek/.pyenv/versions/py37/lib/python3.7/site-packages/clearml/backend_api/utils.py", line 85, in send return super(SessionWithTimeout, self).send(request, **kwargs) File "/Users/derek/.pyenv/versions/py37/lib/python3.7/site-packages/requests/sessions.py", line 645, in send r = adapter.send(request, **kwargs) File "/Users/derek/.pyenv/versions/py37/lib/python3.7/site-packages/requests/adapters.py", line 450, in send timeout=timeout File "/Users/derek/.pyenv/versions/py37/lib/python3.7/site-packages/urllib3/connectionpool.py", line 710, in urlopen chunked=chunked, File "/Users/derek/.pyenv/versions/py37/lib/python3.7/site-packages/urllib3/connectionpool.py", line 398, in _make_request conn.request(method, url, **httplib_request_kw) File "/Users/derek/.pyenv/versions/py37/lib/python3.7/site-packages/urllib3/connection.py", line 239, in request super(HTTPConnection, self).request(method, url, body=body, headers=headers) File "/Users/derek/.pyenv/versions/3.7.12/lib/python3.7/http/client.py", line 1281, in request self._send_request(method, url, body, headers, encode_chunked) File "/Users/derek/.pyenv/versions/3.7.12/lib/python3.7/http/client.py", line 1327, in _send_request self.endheaders(body, encode_chunked=encode_chunked) File "/Users/derek/.pyenv/versions/3.7.12/lib/python3.7/http/client.py", line 1276, in endheaders self._send_output(message_body, encode_chunked=encode_chunked) File "/Users/derek/.pyenv/versions/3.7.12/lib/python3.7/http/client.py", line 1075, in _send_output self.send(chunk) File "/Users/derek/.pyenv/versions/3.7.12/lib/python3.7/http/client.py", line 997, in send self.sock.sendall(data) File "/Users/derek/.pyenv/versions/3.7.12/lib/python3.7/ssl.py", line 1034, in sendall v = self.send(byte_view[count:]) File "/Users/derek/.pyenv/versions/3.7.12/lib/python3.7/ssl.py", line 1003, in send return self._sslobj.write(data) OverflowError: string longer than 2147483647 bytes
dataset = Dataset.create("C4_realnewslike_filtered", "project-x") dataset.add_files("/Users/derek/Desktop/project-x-artifacts/filtered_dataset") dataset.upload() dataset.finalize()

  
  
Posted 2 years ago
Votes Newest

Answers 2


total size 5.34 GB, 1 chunked stored (average size 5.34 GB)PanickyAnt52 The issue itself the Dataset will not break files (it will package into multiple zip files a large folder, but not break the file itself).
The upload itself is limited by the HTTP interface (i.e. 2GB file size limit)
I would just encode it into multiple Arrow files
does that make sense ?

  
  
Posted 2 years ago

Thanks, let me encode them into multiple files and try again.

  
  
Posted 2 years ago
602 Views
2 Answers
2 years ago
one year ago
Tags
Similar posts