Hi PanickyMoth78 ! This will likely not make it into 1.9.0 (this will be the next version we release, most likely before Christmas). We will try to get the fix out in 1.9.1
Hi PanickyMoth78 ! I ran the script and yes, it does take a lot more memory than it should. There is likely a memory leak somewhere in our code. We will keep you updated
That job was using clearml 1.8.3 so I take it that setting max_workers to 1 would not make a difference?
Looking at the docs:
https://clear.ml/docs/latest/docs/references/sdk/dataset/#upload
they say that max_workers = number of cores but looking at the log it does seem like it's doing one chunk every 5 minutes (long time for 500mb upload for a node running in gcp...)
I ran another version of the above code whereoutput_uri="./random_dataset_local_target"
(i.e. db target on local disk instead of gcp).
I still see large memory usage.
I also find it worrisome that while generating the random dataset and writing it to disk took under 3 minutes, generating the hash took 9 minutes and saving the files to a dataset target in an adjacent folder took 30 minutes (10 times longer than writing the original files)! Simply copying the files to an adjacent folder takes less than 1 minute (so disk io is not the bottleneck).
PanickyMoth78 there is no env var for sdk.google.storage.pool_connections/pool_maxsize
. We will likely add these env vars in a future release.
Yes, setting max_workers to 1 would not make a difference. The docs look a bit off, but it is specified that 1: if the upload destination is a cloud provider ('s3', 'gs', 'azure')
.
I'm thinking now that the memory issue might also be cause because of the fact that we prepare the zips in the background. Maybe a higher max_workers
would consume the zips faster. Might be counter intuitive, but I would try setting max_workers
to a higher number.
I tried playing with those parameters on my laptop to no great effect.
Here is code you can use to reproduce the issue:
` import os
from pathlib import Path
from tqdm import tqdm
from clearml import Dataset, Task
def dataset_upload_test(project_id:str, bucket_name:str
):
def _random_file(fpath, sizekb):
fileSizeInBytes = 1024 * sizekb
with open(fpath, "wb") as fout:
fout.write(os.urandom(fileSizeInBytes))
def random_dataset(dataset_path, num_files, file_size_kb, num_per_part):
dataset_path = Path(dataset_path)
for i_file in tqdm(range(num_files)):
fpath = (
dataset_path / f"{int(i_file/num_per_part):05}" / f"f_{i_file:03}.bin"
)
fpath.parent.mkdir(exist_ok=True, parents=True)
_random_file(fpath, file_size_kb)
project_name = "lavi_upload_test"
task_name = "test_upload_01"
task = Task.init(project_name=project_name, task_name=task_name)
dataset_path = Path("random_dataset")
# the next line will generate (2 million non-compressible files with total size ~7.7GB)
random_dataset(dataset_path, 2_000_000, 3, num_per_part=1000)
dataset = Dataset.create(
dataset_name=task_name,
dataset_project=project_name,
dataset_version="0.2",
output_uri="gs://" + bucket_name,
description="test dataset upload",
use_current_task=True,
)
dataset.add_files(dataset_path)
dataset.upload()
dataset.finalize()
task.close()
dataset_upload_test("<your-gcp-project>", "<your-gcs-bucket-name>") `
would setting the max_workers to 1 be a (slower) workaround?
Hi PanickyMoth78 , upload, as far as I know, is handled directly by the google cloud python package, let me see what we can find out about it
PanickyMoth78 You might also want to set some lower values for sdk.google.storage.pool_connections/pool_maxsize
in your clearml.conf
. Newer clearml version set max_workers
to 1 by default, and the number of connections should be tweaked using these values. If it doesn't help, please let us know
PanickyMoth78 Something is definitely wrong here. The fix doesn't seem to be trivial as well... we will prioritize this for the next version
It seems we can perhaps set a chunk size for large uploads ( https://github.com/googleapis/google-cloud-python/issues/5088 )
Hi. Just a reminder that I'd love to know if / when this issue is looked at
Q: is there an equivalent env var for sdk.google.storage.pool_connections/pool_maxsize
? My jobs are running remotely and not within a clearml agent at the moment so they get clearml config through env vars.