Hello Everyone Again! So, I Have A Bit Of An Issue This Time Where Sometimes Clearml Won'T Be Able To Find A File On S3, Occasionally It Logs A 503 Error Too Where It Has Exceeded Its 4 Max Retries. So, Essentially, It'S A Server Problem In A Way. Howeve

Unanswered

@<1523701435869433856:profile|SmugDolphin23> Thanks for the response! Configuring those env vars seems to help, but even with the adaptive mode and 32 or 64 max attempts it still happens to fail at some point. Granted I was using 8 workers and uploading all 1.7M in a single call to add_external_files , but I would have expected the adaptive mode to, well, adapt to that, especially with that many attempts. Currently I'm back to sending them in batches, this time in batches of 50k files, so about 34 batches in total, with a minute of sleep in between the batches, and the adaptive mode or at least the max attempt configuration does seem to help out a lot here.

However, it would be nice if there was a way to specify rate limits in clearml, because it seems to me that the issue here is mainly that it sends a bunch of .exists requests and then .get_metadata requests if the file is found. However, since the server refuses or rejects that check for if the file exists, it just tries to list the path as if it were a directory, which it isn't and so, in the end, the whole thing falls apart.

I would propose a rather naive (and seemingly simple) solution in the form of a feature, that lets one specify the rate limit in the add_external_files method. Suppose it's requests per second:

def add_external_files(self, ..., requests_per_second: int | None = None):
    if requests_per_second is not None:
        if max_workers is None:
            #


            # Changed in version 3.8: Default value of max_workers is changed to min(32, os.cpu_count() + 4).
            estimated_max_workers = min(32, os.cpu_count() + 4)
        else:
            estimated_max_workers = max_workers

        sleep_time = estimated_max_workers / requests_per_second
    else:
        sleep_time = None

    ...

    with ThreadPoolExecutor(...) as tp:
        for ...:
            ...(
                tp.submit(
                    self._add_external_files,
                    ...,
                    sleep_time=sleep_time,
                )
            )

def _add_external_files(self, ..., sleep_time: float | None = None):
    start_time = time.perf_counter()

    ...

    if sleep_time is not None:
        total_time = time.perf_counter() - start_time
        remaining_sleep_time = sleep_time - total_time
        if remaining_sleep_time > 0:
            time.sleep(remaining_sleep_time)

I feel as though an addition like this one would be rather beneficial to some other people as well.
Let me know what you think or if you have any other suggestions on how to handle this!
Thanks!

  				
Posted 
	one year ago

					More
				  		
  Report
		
					LonelyFly9
				
					0
					 × 1

240 Views

0 Answers

one year ago