Hello Everyone Again! So, I Have A Bit Of An Issue This Time Where Sometimes Clearml Won'T Be Able To Find A File On S3, Occasionally It Logs A 503 Error Too Where It Has Exceeded Its 4 Max Retries. So, Essentially, It'S A Server Problem In A Way. Howeve

Answered

Hello everyone again!

So, I have a bit of an issue this time where sometimes ClearML won't be able to find a file on S3, occasionally it logs a 503 error too where it has exceeded its 4 max retries. So, essentially, it's a server problem in a way. However, it doesn't appear that ClearML would try to readd external files after a failed attempt since if I manually check the paths it says it couldn't list, it tells me they do exist, so obviously at the time ClearML was doing the check during the add_external_files call, the server may have refused the connection or similar and thus it just "couldn't find the file".

It would be great to know how to solve this currently. For one, retry adding an external file if it was determined as non-existent the first time around, and second, specify the rate limit for sending requests to the S3 host.

Attached is a diagram to help illustrate the issue at hand. Now, I would very much rather avoid writing a patch for this locally, but if that's the only way to solve this, I can do that if absolutely unavoidable of course. Nonetheless, some features to think about adding!

Sincerely,
Matiiss

  				
Posted 
	9 months ago

					More  		
  Report
		
					LonelyFly9
				
					0
					 × 1

Votes Newest

Answers 9

So, I monkey patched this fix into my code, however, that still did not help, so frankly I have just made it to try again within the _add_external_files method that I'm patching to just check again and list files again if it fails. I think that would be also something that you could add, retries into the _add_external_files method itself, so that it retries calling StorageManager.exists_file because that appears to be the main point of failure in this case. I mean, not a failure caused by ClearML per se, but that is where it fails the whole rest of the process because the file is not added, despite it existing, just the server decides to refuse the request. So, if there could be some way to retry that bit (a configurable number of times and with a configurable delay of course) or similar, that would also be great. SmugDolphin23

  				
Posted 
	9 months ago

					More  		
  Report
		
					LonelyFly9
				
					0
					 × 1

Hi LonelyFly9 ! I think that just adding some retries to exists_file is a good idea, so maybe we will do just that 👍

  				
Posted 
	8 months ago

					More  		
  Report
		
					SmugDolphin23
				
					0

Maybe there's a way to pass some additional config stuffs to boto3 client? Perhaps, change the retry mode to this adaptive one? None

  				
Posted 
	9 months ago

					More  		
  Report
		
					LonelyFly9
				
					0
					 × 1

Also, this was not happening when adding fewer files, at the time of constantly running into this issue, I was trying to add 1.7M files in a single call to add_external_files , then I tried in batches of 100k, but still, it failed to list some of the files (that were actually there), now I'm running in batches of 10k which seems to work fine (at least for now), however, it is rather slow, it takes about 20 minutes to upload those 10k and I have about 170 batches.

  				
Posted 
	9 months ago

					More  		
  Report
		
					LonelyFly9
				
					0
					 × 1

I have a rate limit of 600 requests per minute and I was running into it even with a single worker. And 503 is sort of only part of the issue, I suppose it's rather related, but, the bigger issue (could be caused by what causes the 503 though) is that it evidently is rejected by the host when it checks if the file exists and thus it throw that log message about not being able to list/find a file, however, the file is actually there, it just seems that the server refuses to respond (likely due to the aforementioned rate limiting). The 503 also suggests that it is to do with rate limits.

  				
Posted 
	9 months ago

					More  		
  Report
		
					LonelyFly9
				
					0
					 × 1

Hi LonelyFly9 , what is the reason you're getting 503 from the service ?

  				
Posted 
	9 months ago

					More  		
  Report
		
					CostlyOstrich36
				
					0

Also even using AWS_MAX_ATTEMPTS and AWS_RETRY_MODE did not help, I had set MAX_ATTEMPTS to 1024 and it still failed, so, I would assume that this boto3 configuration unfortunately doesn't really help, really at all? Maybe because the adaptive mode that I was using is still technically experimental so it wasn't really doing anything, I don't know, I just know that it fails

  				
Posted 
	9 months ago

					More  		
  Report
		
					LonelyFly9
				
					0
					 × 1

Hi LonelyFly9 ! ClearML does not allow for those to be configured, but you might consider setting AWS_RETRY_MODE and AWS_MAX_ATTEMPTS env vars. Docs from boto3: None

  				
Posted 
	9 months ago

					More  		
  Report
		
					SmugDolphin23
				
					0

SmugDolphin23 Thanks for the response! Configuring those env vars seems to help, but even with the adaptive mode and 32 or 64 max attempts it still happens to fail at some point. Granted I was using 8 workers and uploading all 1.7M in a single call to add_external_files , but I would have expected the adaptive mode to, well, adapt to that, especially with that many attempts. Currently I'm back to sending them in batches, this time in batches of 50k files, so about 34 batches in total, with a minute of sleep in between the batches, and the adaptive mode or at least the max attempt configuration does seem to help out a lot here.

However, it would be nice if there was a way to specify rate limits in clearml, because it seems to me that the issue here is mainly that it sends a bunch of .exists requests and then .get_metadata requests if the file is found. However, since the server refuses or rejects that check for if the file exists, it just tries to list the path as if it were a directory, which it isn't and so, in the end, the whole thing falls apart.

I would propose a rather naive (and seemingly simple) solution in the form of a feature, that lets one specify the rate limit in the add_external_files method. Suppose it's requests per second:

def add_external_files(self, ..., requests_per_second: int | None = None):
    if requests_per_second is not None:
        if max_workers is None:
            #


            # Changed in version 3.8: Default value of max_workers is changed to min(32, os.cpu_count() + 4).
            estimated_max_workers = min(32, os.cpu_count() + 4)
        else:
            estimated_max_workers = max_workers

        sleep_time = estimated_max_workers / requests_per_second
    else:
        sleep_time = None

    ...

    with ThreadPoolExecutor(...) as tp:
        for ...:
            ...(
                tp.submit(
                    self._add_external_files,
                    ...,
                    sleep_time=sleep_time,
                )
            )

def _add_external_files(self, ..., sleep_time: float | None = None):
    start_time = time.perf_counter()

    ...

    if sleep_time is not None:
        total_time = time.perf_counter() - start_time
        remaining_sleep_time = sleep_time - total_time
        if remaining_sleep_time > 0:
            time.sleep(remaining_sleep_time)

I feel as though an addition like this one would be rather beneficial to some other people as well.
Let me know what you think or if you have any other suggestions on how to handle this!
Thanks!

  				
Posted 
	9 months ago

					More  		
  Report
		
					LonelyFly9
				
					0
					 × 1

Write your answer

761 Views

9 Answers

9 months ago

8 months ago