Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hi. Question About Dataset Upload Errors: When Uploading A

Hi.
Question about Dataset upload errors:

When uploading a clearml.Dataset created with output_uri=" gs://lavi_test/datasets
after adding 20 files of size 50mb each (with random bits so they are essentially compressible)
I get multiple errors at dataset.upload()
Uploading dataset changes (10 files compressed to 488.31 MiB) to Uploading dataset changes (10 files compressed to 488.31 MiB) to 2022-11-14 19:47:04,825 - clearml.storage - ERROR - Failed uploading: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer')) 2022-11-14 19:48:05,905 - clearml.storage - ERROR - Failed uploading: ('Connection aborted.', timeout('The write operation timed out')) 2022-11-14 19:49:04,846 - clearml.storage - ERROR - Failed uploading: ('Connection aborted.', timeout('The write operation timed out')) 2022-11-14 19:50:05,242 - clearml.storage - ERROR - Failed uploading: ('Connection aborted.', timeout('The write operation timed out')) 2022-11-14 19:50:05,296 - clearml.storage - ERROR - Failed uploading: ('Connection aborted.', timeout('The write operation timed out')) 2022-11-14 19:51:05,803 - clearml.storage - ERROR - Failed uploading: ('Connection aborted.', timeout('The write operation timed out')) 2022-11-14 19:51:05,806 - clearml.storage - ERROR - Exception encountered while uploading Upload failed File compression and upload completed: total size 976.62 MiB, 2 chunk(s) stored (average size 488.31 MiB)after dataset.finalize() in this instance I saw only one of the two chunks as a file in gcs (looking in my bucket through the gcp console).
When I performed dataset.get_local_copy() It retrieved only half the files

When I repeated the process using small files things worked correctly. When I tried on larger (100mb) files, all chunks failed to upload and nothing was downloaded in get_local_copy.

Q1: How can I fix this? Is there a timeout parameter?
Q2: How do I know that my Dataset is in a bad state. Should it, perhaps, refuse to finalise ? the Dataset.upload() function has no return value.

I have clearml 1.8.0
My clearml server is a self-hosted one (on gcp)
I am running the code (and uploading) from my laptop and my internet connection is relatively good. (20 mbit / sec upload speed)

  
  
Posted 2 years ago
Votes Newest

Answers 16


🙏

  
  
Posted 2 years ago

Unfortunately that is correct. It continues as if nothing happened!

oh dear, let me make sure this is taken care of
And thank you for the reproduce code!!!

  
  
Posted 2 years ago

Hi PanickyMoth78 an RC with a fix is out, let me know if it works (notice you can now set the max_workers from CLI or Dataset functions) pip install clearml==1.8.1rc1

  
  
Posted 2 years ago

I have google-cloud-storage==2.6.0 installed

  
  
Posted 2 years ago

PanickyMoth78 quick update the fix is already being tested, I'm hoping an RC tomorrow 🙂

  
  
Posted 2 years ago

If

Dataset.upload()

does not crash or return a success value that I can check and

Are you saying that with this error showing upload data does not crash? (edited)

Unfortunately that is correct. It continues as if nothing happened!

To replicate this in linux (even with max_workers=1 ):
https://averagelinuxuser.com/limit-bandwidth-linux/ to throttle your connection: sudo apt-get install wondershaper
Throttle your connection to 1mb/s with something like sudo wondershaper wlo1 1024 1024
(where wlo1 is my network connection)
Run the attached script: python dataset_fail.py

(to stop throttling: sudo wondershaper clear wlo1 )

  
  
Posted 2 years ago

Hi PanickyMoth78 ,

Can you try with pip install clearml==1.8.1rc0 ? it should include a fix for this issue

  
  
Posted 2 years ago

My apologies you are correct 1.8.1rc0 🙂

  
  
Posted 2 years ago

Thanks AgitatedDove14
setting max_workers to 1 prevents the error (but, I assume, it may come the cost of slower sequential uploads).

My main concern now is that this may happen within a pipeline leading to unreliable data handling.

If Dataset.upload() does not crash or return a success value that I can check and if Dataste.get_local_copy() also does not complain as it retrieves partial data - how will I ever know that I lost part of my dataset?

  
  
Posted 2 years ago

I can't find version 1.8.1rc1 but I believe I see a relevant change in code of Dataset.upload in 1.8.1rc0

  
  
Posted 2 years ago

Hi PanickyMoth78
Yes i think you are correct, this looks like gs throttling your connection. You can control the number of concurrent uploads with max_worker=1
https://github.com/allegroai/clearml/blob/cf7361e134554f4effd939ca67e8ecb2345bebff/clearml/datasets/dataset.py#L604
Let me know if it works

  
  
Posted 2 years ago

this is the printout I get:

  
  
Posted 2 years ago

setting max_workers to 1 prevents the error (but, I assume, it may come the cost of slower sequential uploads).

This seems like a question to GS storage, maybe we should open an issue there, their backend does the rate limit

My main concern now is that this may happen within a pipeline leading to unreliable data handling.

I'm assuming the pipeline code will have max_workers, but maybe we could have a configuration value so that we can set it across all workers, wdyt?

If

Dataset.upload()

does not crash or return a success value that I can check and

Are you saying that with this error showing upload data does not crash?

  
  
Posted 2 years ago
1K Views
16 Answers
2 years ago
one year ago
Tags
gcp