Hi @<1724235687256920064:profile|LonelyFly9> , what is the reason you're getting 503 from the service ?
I have a rate limit of 600 requests per minute and I was running into it even with a single worker. And 503 is sort of only part of the issue, I suppose it's rather related, but, the bigger issue (could be caused by what causes the 503 though) is that it evidently is rejected by the host when it checks if the file exists and thus it throw that log message about not being able to list/find a file, however, the file is actually there, it just seems that the server refuses to respond (likely due to the aforementioned rate limiting). The 503 also suggests that it is to do with rate limits.
Also, this was not happening when adding fewer files, at the time of constantly running into this issue, I was trying to add 1.7M files in a single call to add_external_files
, then I tried in batches of 100k, but still, it failed to list some of the files (that were actually there), now I'm running in batches of 10k which seems to work fine (at least for now), however, it is rather slow, it takes about 20 minutes to upload those 10k and I have about 170 batches.
Maybe there's a way to pass some additional config stuffs to boto3 client? Perhaps, change the retry mode to this adaptive one? None
Hi @<1724235687256920064:profile|LonelyFly9> ! ClearML does not allow for those to be configured, but you might consider setting AWS_RETRY_MODE and AWS_MAX_ATTEMPTS env vars. Docs from boto3: None
@<1523701435869433856:profile|SmugDolphin23> Thanks for the response! Configuring those env vars seems to help, but even with the adaptive mode and 32 or 64 max attempts it still happens to fail at some point. Granted I was using 8 workers and uploading all 1.7M in a single call to add_external_files
, but I would have expected the adaptive mode to, well, adapt to that, especially with that many attempts. Currently I'm back to sending them in batches, this time in batches of 50k files, so about 34 batches in total, with a minute of sleep in between the batches, and the adaptive mode or at least the max attempt configuration does seem to help out a lot here.
However, it would be nice if there was a way to specify rate limits in clearml, because it seems to me that the issue here is mainly that it sends a bunch of .exists
requests and then .get_metadata
requests if the file is found. However, since the server refuses or rejects that check for if the file exists, it just tries to list the path as if it were a directory, which it isn't and so, in the end, the whole thing falls apart.
I would propose a rather naive (and seemingly simple) solution in the form of a feature, that lets one specify the rate limit in the add_external_files
method. Suppose it's requests per second:
def add_external_files(self, ..., requests_per_second: int | None = None):
if requests_per_second is not None:
if max_workers is None:
#
# Changed in version 3.8: Default value of max_workers is changed to min(32, os.cpu_count() + 4).
estimated_max_workers = min(32, os.cpu_count() + 4)
else:
estimated_max_workers = max_workers
sleep_time = estimated_max_workers / requests_per_second
else:
sleep_time = None
...
with ThreadPoolExecutor(...) as tp:
for ...:
...(
tp.submit(
self._add_external_files,
...,
sleep_time=sleep_time,
)
)
def _add_external_files(self, ..., sleep_time: float | None = None):
start_time = time.perf_counter()
...
if sleep_time is not None:
total_time = time.perf_counter() - start_time
remaining_sleep_time = sleep_time - total_time
if remaining_sleep_time > 0:
time.sleep(remaining_sleep_time)
I feel as though an addition like this one would be rather beneficial to some other people as well.
Let me know what you think or if you have any other suggestions on how to handle this!
Thanks!
So, I monkey patched this fix into my code, however, that still did not help, so frankly I have just made it to try again within the _add_external_files
method that I'm patching to just check again and list files again if it fails. I think that would be also something that you could add, retries into the _add_external_files
method itself, so that it retries calling StorageManager.exists_file
because that appears to be the main point of failure in this case. I mean, not a failure caused by ClearML per se, but that is where it fails the whole rest of the process because the file is not added, despite it existing, just the server decides to refuse the request. So, if there could be some way to retry that bit (a configurable number of times and with a configurable delay of course) or similar, that would also be great. @<1523701435869433856:profile|SmugDolphin23>
Also even using AWS_MAX_ATTEMPTS
and AWS_RETRY_MODE
did not help, I had set MAX_ATTEMPTS
to 1024 and it still failed, so, I would assume that this boto3 configuration unfortunately doesn't really help, really at all? Maybe because the adaptive mode that I was using is still technically experimental so it wasn't really doing anything, I don't know, I just know that it fails
Hi @<1724235687256920064:profile|LonelyFly9> ! I think that just adding some retries to exists_file
is a good idea, so maybe we will do just that 👍