Hi PanickyMoth78 ! This will likely not make it into 1.9.0 (this will be the next version we release, most likely before Christmas). We will try to get the fix out in 1.9.1
I ran another version of the above code whereoutput_uri="./random_dataset_local_target"
(i.e. db target on local disk instead of gcp).
I still see large memory usage.
I also find it worrisome that while generating the random dataset and writing it to disk took under 3 minutes, generating the hash took 9 minutes and saving the files to a dataset target in an adjacent folder took 30 minutes (10 times longer than writing the original files)! Simply copying the files to an adjacent folder takes less than 1 minute (so disk io is not the bottleneck).
Hi PanickyMoth78 , upload, as far as I know, is handled directly by the google cloud python package, let me see what we can find out about it
It seems we can perhaps set a chunk size for large uploads ( https://github.com/googleapis/google-cloud-python/issues/5088 )
would setting the max_workers to 1 be a (slower) workaround?
Hi. Just a reminder that I'd love to know if / when this issue is looked at
Q: is there an equivalent env var for sdk.google.storage.pool_connections/pool_maxsize
? My jobs are running remotely and not within a clearml agent at the moment so they get clearml config through env vars.
PanickyMoth78 You might also want to set some lower values for sdk.google.storage.pool_connections/pool_maxsize
in your clearml.conf
. Newer clearml version set max_workers
to 1 by default, and the number of connections should be tweaked using these values. If it doesn't help, please let us know
That job was using clearml 1.8.3 so I take it that setting max_workers to 1 would not make a difference?
Looking at the docs:
https://clear.ml/docs/latest/docs/references/sdk/dataset/#upload
they say that max_workers = number of cores but looking at the log it does seem like it's doing one chunk every 5 minutes (long time for 500mb upload for a node running in gcp...)
PanickyMoth78 there is no env var for sdk.google.storage.pool_connections/pool_maxsize
. We will likely add these env vars in a future release.
Yes, setting max_workers to 1 would not make a difference. The docs look a bit off, but it is specified that 1: if the upload destination is a cloud provider ('s3', 'gs', 'azure')
.
I'm thinking now that the memory issue might also be cause because of the fact that we prepare the zips in the background. Maybe a higher max_workers
would consume the zips faster. Might be counter intuitive, but I would try setting max_workers
to a higher number.
Hi PanickyMoth78 ! I ran the script and yes, it does take a lot more memory than it should. There is likely a memory leak somewhere in our code. We will keep you updated
I tried playing with those parameters on my laptop to no great effect.
Here is code you can use to reproduce the issue:
` import os
from pathlib import Path
from tqdm import tqdm
from clearml import Dataset, Task
def dataset_upload_test(project_id:str, bucket_name:str
):
def _random_file(fpath, sizekb):
fileSizeInBytes = 1024 * sizekb
with open(fpath, "wb") as fout:
fout.write(os.urandom(fileSizeInBytes))
def random_dataset(dataset_path, num_files, file_size_kb, num_per_part):
dataset_path = Path(dataset_path)
for i_file in tqdm(range(num_files)):
fpath = (
dataset_path / f"{int(i_file/num_per_part):05}" / f"f_{i_file:03}.bin"
)
fpath.parent.mkdir(exist_ok=True, parents=True)
_random_file(fpath, file_size_kb)
project_name = "lavi_upload_test"
task_name = "test_upload_01"
task = Task.init(project_name=project_name, task_name=task_name)
dataset_path = Path("random_dataset")
# the next line will generate (2 million non-compressible files with total size ~7.7GB)
random_dataset(dataset_path, 2_000_000, 3, num_per_part=1000)
dataset = Dataset.create(
dataset_name=task_name,
dataset_project=project_name,
dataset_version="0.2",
output_uri="gs://" + bucket_name,
description="test dataset upload",
use_current_task=True,
)
dataset.add_files(dataset_path)
dataset.upload()
dataset.finalize()
task.close()
dataset_upload_test("<your-gcp-project>", "<your-gcs-bucket-name>") `
PanickyMoth78 Something is definitely wrong here. The fix doesn't seem to be trivial as well... we will prioritize this for the next version