Hi. Question About Dataset Upload Errors: When Uploading A

Answered

Hi.
Question about Dataset upload errors:

When uploading a clearml.Dataset created with output_uri=" gs://lavi_test/datasets
after adding 20 files of size 50mb each (with random bits so they are essentially compressible)
I get multiple errors at dataset.upload()
Uploading dataset changes (10 files compressed to 488.31 MiB) to Uploading dataset changes (10 files compressed to 488.31 MiB) to 2022-11-14 19:47:04,825 - clearml.storage - ERROR - Failed uploading: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer')) 2022-11-14 19:48:05,905 - clearml.storage - ERROR - Failed uploading: ('Connection aborted.', timeout('The write operation timed out')) 2022-11-14 19:49:04,846 - clearml.storage - ERROR - Failed uploading: ('Connection aborted.', timeout('The write operation timed out')) 2022-11-14 19:50:05,242 - clearml.storage - ERROR - Failed uploading: ('Connection aborted.', timeout('The write operation timed out')) 2022-11-14 19:50:05,296 - clearml.storage - ERROR - Failed uploading: ('Connection aborted.', timeout('The write operation timed out')) 2022-11-14 19:51:05,803 - clearml.storage - ERROR - Failed uploading: ('Connection aborted.', timeout('The write operation timed out')) 2022-11-14 19:51:05,806 - clearml.storage - ERROR - Exception encountered while uploading Upload failed File compression and upload completed: total size 976.62 MiB, 2 chunk(s) stored (average size 488.31 MiB)after dataset.finalize() in this instance I saw only one of the two chunks as a file in gcs (looking in my bucket through the gcp console).
When I performed dataset.get_local_copy() It retrieved only half the files

When I repeated the process using small files things worked correctly. When I tried on larger (100mb) files, all chunks failed to upload and nothing was downloaded in get_local_copy.

Q1: How can I fix this? Is there a timeout parameter?
Q2: How do I know that my Dataset is in a bad state. Should it, perhaps, refuse to finalise ? the Dataset.upload() function has no return value.

I have clearml 1.8.0
My clearml server is a self-hosted one (on gcp)
I am running the code (and uploading) from my laptop and my internet connection is relatively good. (20 mbit / sec upload speed)

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					PanickyMoth78
				
					0
					 × 1

Votes Newest

Answers 16

🙏

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					PanickyMoth78
				
					0
					 × 1

Unfortunately that is correct. It continues as if nothing happened!

oh dear, let me make sure this is taken care of
And thank you for the reproduce code!!!

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Hi PanickyMoth78 an RC with a fix is out, let me know if it works (notice you can now set the max_workers from CLI or Dataset functions) pip install clearml==1.8.1rc1

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

I have google-cloud-storage==2.6.0 installed

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					PanickyMoth78
				
					0
					 × 1

PanickyMoth78 quick update the fix is already being tested, I'm hoping an RC tomorrow 🙂

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

If

Dataset.upload()

does not crash or return a success value that I can check and

Are you saying that with this error showing upload data does not crash? (edited)

Unfortunately that is correct. It continues as if nothing happened!

To replicate this in linux (even with max_workers=1 ):
https://averagelinuxuser.com/limit-bandwidth-linux/ to throttle your connection: sudo apt-get install wondershaper
Throttle your connection to 1mb/s with something like sudo wondershaper wlo1 1024 1024
(where wlo1 is my network connection)
Run the attached script: python dataset_fail.py

(to stop throttling: sudo wondershaper clear wlo1 )

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					PanickyMoth78
				
					0
					 × 1

maybe this line should take a timeout argument?
https://github.com/allegroai/clearml/blob/d45ec5d3e2caf1af477b37fcb36a81595fb9759f/clearml/storage/helper.py#L1834

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					PanickyMoth78
				
					0
					 × 1

https://github.com/allegroai/clearml/issues/819

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					PanickyMoth78
				
					0
					 × 1

Hi PanickyMoth78 ,

Can you try with pip install clearml==1.8.1rc0 ? it should include a fix for this issue

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					TimelyPenguin76
				
					0
					 Administrator

My apologies you are correct 1.8.1rc0 🙂

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Thanks AgitatedDove14
setting max_workers to 1 prevents the error (but, I assume, it may come the cost of slower sequential uploads).

My main concern now is that this may happen within a pipeline leading to unreliable data handling.

If Dataset.upload() does not crash or return a success value that I can check and if Dataste.get_local_copy() also does not complain as it retrieves partial data - how will I ever know that I lost part of my dataset?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					PanickyMoth78
				
					0
					 × 1

I can't find version 1.8.1rc1 but I believe I see a relevant change in code of Dataset.upload in 1.8.1rc0

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					PanickyMoth78
				
					0
					 × 1

Hi PanickyMoth78
Yes i think you are correct, this looks like gs throttling your connection. You can control the number of concurrent uploads with max_worker=1
https://github.com/allegroai/clearml/blob/cf7361e134554f4effd939ca67e8ecb2345bebff/clearml/datasets/dataset.py#L604
Let me know if it works

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

this is the printout I get:

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					PanickyMoth78
				
					0
					 × 1

setting max_workers to 1 prevents the error (but, I assume, it may come the cost of slower sequential uploads).

This seems like a question to GS storage, maybe we should open an issue there, their backend does the rate limit

My main concern now is that this may happen within a pipeline leading to unreliable data handling.

I'm assuming the pipeline code will have max_workers, but maybe we could have a configuration value so that we can set it across all workers, wdyt?

If

Dataset.upload()

does not crash or return a success value that I can check and

Are you saying that with this error showing upload data does not crash?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

This seems relevant:
https://stackoverflow.com/questions/61001454/why-does-upload-from-file-google-cloud-storage-function-throws-timeout-error

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					PanickyMoth78
				
					0
					 × 1

Write your answer

1K Views

16 Answers

2 years ago

one year ago