@<1523701435869433856:profile|SmugDolphin23> Thanks for the response! Configuring those env vars seems to help, but even with the adaptive mode and 32 or 64 max attempts it still happens to fail at some point. Granted I was using 8 workers and uploading all 1.7M in a single call to add_external_files
, but I would have expected the adaptive mode to, well, adapt to that, especially with that many attempts. Currently I'm back to sending them in batches, this time in batches of 50k files, so about 34 batches in total, with a minute of sleep in between the batches, and the adaptive mode or at least the max attempt configuration does seem to help out a lot here.
However, it would be nice if there was a way to specify rate limits in clearml, because it seems to me that the issue here is mainly that it sends a bunch of .exists
requests and then .get_metadata
requests if the file is found. However, since the server refuses or rejects that check for if the file exists, it just tries to list the path as if it were a directory, which it isn't and so, in the end, the whole thing falls apart.
I would propose a rather naive (and seemingly simple) solution in the form of a feature, that lets one specify the rate limit in the add_external_files
method. Suppose it's requests per second:
def add_external_files(self, ..., requests_per_second: int | None = None):
if requests_per_second is not None:
if max_workers is None:
#
# Changed in version 3.8: Default value of max_workers is changed to min(32, os.cpu_count() + 4).
estimated_max_workers = min(32, os.cpu_count() + 4)
else:
estimated_max_workers = max_workers
sleep_time = estimated_max_workers / requests_per_second
else:
sleep_time = None
...
with ThreadPoolExecutor(...) as tp:
for ...:
...(
tp.submit(
self._add_external_files,
...,
sleep_time=sleep_time,
)
)
def _add_external_files(self, ..., sleep_time: float | None = None):
start_time = time.perf_counter()
...
if sleep_time is not None:
total_time = time.perf_counter() - start_time
remaining_sleep_time = sleep_time - total_time
if remaining_sleep_time > 0:
time.sleep(remaining_sleep_time)
I feel as though an addition like this one would be rather beneficial to some other people as well.
Let me know what you think or if you have any other suggestions on how to handle this!
Thanks!