Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hi. I Have A Job That Processes Images And Creates ~5 Gb Of Processed Image Files (Lots Of Small Ones). At The End - It Creates A

Hi.
I have a job that processes images and creates ~5 GB of processed image files (lots of small ones).
At the end - it creates a clearml.Dataset and performs add + upload + finalize.
It looks like upload is using up a lot of memory. Does that make sense?
These are the last messages before the node ran out of memory:
2022-12-07 23:19:45 Hash generation completed 2022-12-07 23:22:32 Uploading dataset files: {'show_progress': True, 'verbose': False, 'output_url': None, 'compression': None} 2022-12-07 23:26:46 Uploading dataset changes (180910 files compressed to 546.3 MiB) to gs://***** 2022-12-07 23:31:44 Uploading dataset changes (228446 files compressed to 555.26 MiB) to gs://***** 2022-12-07 23:36:41 Uploading dataset changes (264574 files compressed to 562.09 MiB) to gs://***** 2022-12-07 23:42:30 Uploading dataset changes (302730 files compressed to 569.27 MiB) to gs://***** 2022-12-07 23:49:04 Uploading dataset changes (348083 files compressed to 577.84 MiB) to gs://***** 2022-12-07 23:56:09 Uploading dataset changes (410295 files compressed to 589.6 MiB) to gs://*****

  
  
Posted one year ago
Votes Newest

Answers 14


Hi PanickyMoth78 ! This will likely not make it into 1.9.0 (this will be the next version we release, most likely before Christmas). We will try to get the fix out in 1.9.1

  
  
Posted one year ago

I ran another version of the above code where
output_uri="./random_dataset_local_target"
(i.e. db target on local disk instead of gcp).
I still see large memory usage.
I also find it worrisome that while generating the random dataset and writing it to disk took under 3 minutes, generating the hash took 9 minutes and saving the files to a dataset target in an adjacent folder took 30 minutes (10 times longer than writing the original files)! Simply copying the files to an adjacent folder takes less than 1 minute (so disk io is not the bottleneck).

  
  
Posted one year ago

Hi PanickyMoth78 , upload, as far as I know, is handled directly by the google cloud python package, let me see what we can find out about it

  
  
Posted one year ago

It seems we can perhaps set a chunk size for large uploads ( https://github.com/googleapis/google-cloud-python/issues/5088 )

  
  
Posted one year ago

would setting the max_workers to 1 be a (slower) workaround?

  
  
Posted one year ago

Hi. Just a reminder that I'd love to know if / when this issue is looked at

  
  
Posted one year ago

Q: is there an equivalent env var for sdk.google.storage.pool_connections/pool_maxsize ? My jobs are running remotely and not within a clearml agent at the moment so they get clearml config through env vars.

  
  
Posted one year ago

PanickyMoth78 You might also want to set some lower values for sdk.google.storage.pool_connections/pool_maxsize in your clearml.conf . Newer clearml version set max_workers to 1 by default, and the number of connections should be tweaked using these values. If it doesn't help, please let us know

  
  
Posted one year ago

That job was using clearml 1.8.3 so I take it that setting max_workers to 1 would not make a difference?
Looking at the docs:
https://clear.ml/docs/latest/docs/references/sdk/dataset/#upload
they say that max_workers = number of cores but looking at the log it does seem like it's doing one chunk every 5 minutes (long time for 500mb upload for a node running in gcp...)

  
  
Posted one year ago

PanickyMoth78 there is no env var for sdk.google.storage.pool_connections/pool_maxsize . We will likely add these env vars in a future release.
Yes, setting max_workers to 1 would not make a difference. The docs look a bit off, but it is specified that 1: if the upload destination is a cloud provider ('s3', 'gs', 'azure') .
I'm thinking now that the memory issue might also be cause because of the fact that we prepare the zips in the background. Maybe a higher max_workers would consume the zips faster. Might be counter intuitive, but I would try setting max_workers to a higher number.

  
  
Posted one year ago

That sounds like a possible approach

  
  
Posted one year ago

Hi PanickyMoth78 ! I ran the script and yes, it does take a lot more memory than it should. There is likely a memory leak somewhere in our code. We will keep you updated

  
  
Posted one year ago

I tried playing with those parameters on my laptop to no great effect.

Here is code you can use to reproduce the issue:

` import os
from pathlib import Path
from tqdm import tqdm
from clearml import Dataset, Task

def dataset_upload_test(project_id:str, bucket_name:str
):
def _random_file(fpath, sizekb):
fileSizeInBytes = 1024 * sizekb
with open(fpath, "wb") as fout:
fout.write(os.urandom(fileSizeInBytes))

def random_dataset(dataset_path, num_files, file_size_kb, num_per_part):
    dataset_path = Path(dataset_path)
    for i_file in tqdm(range(num_files)):
        fpath = (
            dataset_path / f"{int(i_file/num_per_part):05}" / f"f_{i_file:03}.bin"
        )
        fpath.parent.mkdir(exist_ok=True, parents=True)
        _random_file(fpath, file_size_kb)

project_name = "lavi_upload_test"
task_name = "test_upload_01"
task = Task.init(project_name=project_name, task_name=task_name)

dataset_path = Path("random_dataset")
# the next line will generate (2 million non-compressible files with total size ~7.7GB)
random_dataset(dataset_path, 2_000_000, 3, num_per_part=1000)
dataset = Dataset.create(
    dataset_name=task_name,
    dataset_project=project_name,
    dataset_version="0.2",
    output_uri="gs://" + bucket_name,
    description="test dataset upload",
    use_current_task=True,
)
dataset.add_files(dataset_path)
dataset.upload()
dataset.finalize()
task.close()

dataset_upload_test("<your-gcp-project>", "<your-gcs-bucket-name>") `

  
  
Posted one year ago

PanickyMoth78 Something is definitely wrong here. The fix doesn't seem to be trivial as well... we will prioritize this for the next version

  
  
Posted one year ago