ClearML FAQ | Hi. I Have A Job That Processes Images And Creates ~5 Gb Of Processed Image Files (Lots Of Small Ones). At The End

Answered

Hi. I Have A Job That Processes Images And Creates ~5 Gb Of Processed Image Files (Lots Of Small Ones). At The End - It Creates A

Hi.
I have a job that processes images and creates ~5 GB of processed image files (lots of small ones).
At the end - it creates a clearml.Dataset and performs add + upload + finalize.
It looks like upload is using up a lot of memory. Does that make sense?
These are the last messages before the node ran out of memory:
2022-12-07 23:19:45 Hash generation completed 2022-12-07 23:22:32 Uploading dataset files: {'show_progress': True, 'verbose': False, 'output_url': None, 'compression': None} 2022-12-07 23:26:46 Uploading dataset changes (180910 files compressed to 546.3 MiB) to gs://***** 2022-12-07 23:31:44 Uploading dataset changes (228446 files compressed to 555.26 MiB) to gs://***** 2022-12-07 23:36:41 Uploading dataset changes (264574 files compressed to 562.09 MiB) to gs://***** 2022-12-07 23:42:30 Uploading dataset changes (302730 files compressed to 569.27 MiB) to gs://***** 2022-12-07 23:49:04 Uploading dataset changes (348083 files compressed to 577.84 MiB) to gs://***** 2022-12-07 23:56:09 Uploading dataset changes (410295 files compressed to 589.6 MiB) to gs://*****

  				
Posted 
	2 years ago

					More  		
  Report
		
					PanickyMoth78
				
					0
					 × 1

Votes Newest

Answers 14

Q: is there an equivalent env var for sdk.google.storage.pool_connections/pool_maxsize ? My jobs are running remotely and not within a clearml agent at the moment so they get clearml config through env vars.

  				
Posted 
	2 years ago

					More  		
  Report
		
					PanickyMoth78
				
					0
					 × 1

Hi PanickyMoth78 , upload, as far as I know, is handled directly by the google cloud python package, let me see what we can find out about it

  				
Posted 
	2 years ago

					More  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

PanickyMoth78 Something is definitely wrong here. The fix doesn't seem to be trivial as well... we will prioritize this for the next version

  				
Posted 
	2 years ago

					More  		
  Report
		
					SmugDolphin23
				
					0

That sounds like a possible approach

  				
Posted 
	2 years ago

					More  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

I tried playing with those parameters on my laptop to no great effect.

Here is code you can use to reproduce the issue:

` import os
from pathlib import Path
from tqdm import tqdm
from clearml import Dataset, Task

def dataset_upload_test(project_id:str, bucket_name:str
):
def _random_file(fpath, sizekb):
fileSizeInBytes = 1024 * sizekb
with open(fpath, "wb") as fout:
fout.write(os.urandom(fileSizeInBytes))

def random_dataset(dataset_path, num_files, file_size_kb, num_per_part):
    dataset_path = Path(dataset_path)
    for i_file in tqdm(range(num_files)):
        fpath = (
            dataset_path / f"{int(i_file/num_per_part):05}" / f"f_{i_file:03}.bin"
        )
        fpath.parent.mkdir(exist_ok=True, parents=True)
        _random_file(fpath, file_size_kb)

project_name = "lavi_upload_test"
task_name = "test_upload_01"
task = Task.init(project_name=project_name, task_name=task_name)

dataset_path = Path("random_dataset")
# the next line will generate (2 million non-compressible files with total size ~7.7GB)
random_dataset(dataset_path, 2_000_000, 3, num_per_part=1000)
dataset = Dataset.create(
    dataset_name=task_name,
    dataset_project=project_name,
    dataset_version="0.2",
    output_uri="gs://" + bucket_name,
    description="test dataset upload",
    use_current_task=True,
)
dataset.add_files(dataset_path)
dataset.upload()
dataset.finalize()
task.close()

dataset_upload_test("<your-gcp-project>", "<your-gcs-bucket-name>") `

  				
Posted 
	2 years ago

					More  		
  Report
		
					PanickyMoth78
				
					0
					 × 1

It seems we can perhaps set a chunk size for large uploads ( https://github.com/googleapis/google-cloud-python/issues/5088 )

  				
Posted 
	2 years ago

					More  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

PanickyMoth78 there is no env var for sdk.google.storage.pool_connections/pool_maxsize . We will likely add these env vars in a future release.
Yes, setting max_workers to 1 would not make a difference. The docs look a bit off, but it is specified that 1: if the upload destination is a cloud provider ('s3', 'gs', 'azure') .
I'm thinking now that the memory issue might also be cause because of the fact that we prepare the zips in the background. Maybe a higher max_workers would consume the zips faster. Might be counter intuitive, but I would try setting max_workers to a higher number.

  				
Posted 
	2 years ago

					More  		
  Report
		
					SmugDolphin23
				
					0

would setting the max_workers to 1 be a (slower) workaround?

  				
Posted 
	2 years ago

					More  		
  Report
		
					PanickyMoth78
				
					0
					 × 1

Hi. Just a reminder that I'd love to know if / when this issue is looked at

  				
Posted 
	2 years ago

					More  		
  Report
		
					PanickyMoth78
				
					0
					 × 1

PanickyMoth78 You might also want to set some lower values for sdk.google.storage.pool_connections/pool_maxsize in your clearml.conf . Newer clearml version set max_workers to 1 by default, and the number of connections should be tweaked using these values. If it doesn't help, please let us know

  				
Posted 
	2 years ago

					More  		
  Report
		
					SmugDolphin23
				
					0

That job was using clearml 1.8.3 so I take it that setting max_workers to 1 would not make a difference?
Looking at the docs:
https://clear.ml/docs/latest/docs/references/sdk/dataset/#upload
they say that max_workers = number of cores but looking at the log it does seem like it's doing one chunk every 5 minutes (long time for 500mb upload for a node running in gcp...)

  				
Posted 
	2 years ago

					More  		
  Report
		
					PanickyMoth78
				
					0
					 × 1

I ran another version of the above code where
output_uri="./random_dataset_local_target"
(i.e. db target on local disk instead of gcp).
I still see large memory usage.
I also find it worrisome that while generating the random dataset and writing it to disk took under 3 minutes, generating the hash took 9 minutes and saving the files to a dataset target in an adjacent folder took 30 minutes (10 times longer than writing the original files)! Simply copying the files to an adjacent folder takes less than 1 minute (so disk io is not the bottleneck).

  				
Posted 
	2 years ago

					More  		
  Report
		
					PanickyMoth78
				
					0
					 × 1

Hi PanickyMoth78 ! I ran the script and yes, it does take a lot more memory than it should. There is likely a memory leak somewhere in our code. We will keep you updated

  				
Posted 
	2 years ago

					More  		
  Report
		
					SmugDolphin23
				
					0

Hi PanickyMoth78 ! This will likely not make it into 1.9.0 (this will be the next version we release, most likely before Christmas). We will try to get the fix out in 1.9.1

  				
Posted 
	2 years ago

					More  		
  Report
		
					SmugDolphin23
				
					0

Write your answer

2K Views

14 Answers

2 years ago