Hello, Im Having Huge Performance Issues On Large Clearml Datasets How Can I Link To Parent Dataset Without Parent Dataset Files. I Want To Create A Smaller Subset Of Parent Dataset, Like 5% Of It. To Achieve This, I Have To Call Remove_Files() To 60K It

Answered

Hello, im having huge performance issues on large Clearml Datasets

How can I link to parent dataset without parent dataset files. I want to create a smaller subset of parent dataset, like 5% of it. To achieve this, I have to call remove_files() to 60K items which is slow, it cant even take a list, just a SINGLE file???.
It would make more sense to just add these 5% to new dataset rather than remove 95% from parent

remove_files() takes 2 minutes to complete and I have 60K items to remove.

  				
Posted 
	one year ago

					More
				  		
  Report
		
					AmiableSeaturtle81
				
					0
					 × 1

Votes Newest

Answers 14

@<1709740168430227456:profile|HomelyBluewhale47> We have the same problem. Millions of files, stored on CEPH. I would not recommend you to do it this way. Everything gets insanely slow (dataset.list_files, downloading the dataset, removing files)

The way I use Clearml Datasets for large number of samples now is to save a json which stores all paths to samples in Dataset metadata:
clearml_dataset.set_metadata(metadata, metadata_name=metadata_key)

However this then means that you need wrappers to download the dataset

  				
Posted 
	one year ago

					More
				  		
  Report
		
					AmiableSeaturtle81
				
					0
					 × 1

And what I feel; pulling data by using some client eg: minio-client for minio is faster than clearml

  				
Posted 
	one year ago

					More
				  		
  Report
		
					HomelyBluewhale47
				
					0
					 × 1

Thanks a lot @<1523701435869433856:profile|SmugDolphin23>
will go through this one.
your quick reply really means a lot. Thanks again!!

  				
Posted 
	one year ago

					More
				  		
  Report
		
					HomelyBluewhale47
				
					0
					 × 1

Well I can understand well your problem. @<1590514584836378624:profile|AmiableSeaturtle81>
Please let me know if you have answers for my problem. 😄

  				
Posted 
	one year ago

					More
				  		
  Report
		
					HomelyBluewhale47
				
					0
					 × 1

Hi @<1590514584836378624:profile|AmiableSeaturtle81> ! Looks like remove_files doesn't support lists indeed. It does support paths with wildcards tho, if that helps.
I would remove all the files to the dataset and add only the ones you need back as a workaround for now, or just create a new dataset

  				
Posted 
	one year ago

					More
				  		
  Report
		
					SmugDolphin23
				
					0

Everytime I start training; first download dataset using minio client and then do further operations.

Can we integrate minio easily with clearML?

  				
Posted 
	one year ago

					More
				  		
  Report
		
					HomelyBluewhale47
				
					0
					 × 1

@<1709740168430227456:profile|HomelyBluewhale47> you should be able to upload the images and download them without a problem. You could also use a cloud provider to store your files such as s3 if you believe it would speed things up

  				
Posted 
	one year ago

					More
				  		
  Report
		
					SmugDolphin23
				
					0

Yes, this is what I’m doing currently. But strugginling to manage the versions and all.

  				
Posted 
	one year ago

					More
				  		
  Report
		
					HomelyBluewhale47
				
					0
					 × 1

And just FYI: one dataset size can beyond 80GB.

  				
Posted 
	one year ago

					More
				  		
  Report
		
					HomelyBluewhale47
				
					0
					 × 1

otherwise, you could run this as a hack:

        dataset._dataset_file_entries = {
            k: v
            for k, v in self._dataset_file_entries.items()
            if k not in files_to_remove  # you need to define this
        }

then call dataset.remove_files with a path that doesn't exist in the dataset.

  				
Posted 
	one year ago

					More
				  		
  Report
		
					SmugDolphin23
				
					0

@<1590514584836378624:profile|AmiableSeaturtle81>
Yes, I can also feel that slowness.

But what about if we have millions of pdfs and we need to generate a image dataset along with their bounding boxes. We cannot generate it on the fly while training the model as it will very very very slow.
So I’m doing all preprocessing the upload preprocessed data back to storage.
And while training again pulling preprocessed dataset from storage. 😄

  				
Posted 
	one year ago

					More
				  		
  Report
		
					HomelyBluewhale47
				
					0
					 × 1

Hi @<1523701435869433856:profile|SmugDolphin23>
This is follow up question from my side.

Is it efficient to use clearML data for hundred of thousands of image dataset. I’ve a concern about performance.

So basically I’m training a transformers classifier. which need at least 300-400K images.

  				
Posted 
	one year ago

					More
				  		
  Report
		
					HomelyBluewhale47
				
					0
					 × 1

Yes, see minio instructions under this: None

  				
Posted 
	one year ago

					More
				  		
  Report
		
					SmugDolphin23
				
					0

You can check out boto3 python client (This is what we use to download / upload all S3 stuff), but minio-client probably already uses it under the hood.
We also use aws cli to do some downloading, it is way faster than python.

Regarding pdfs, yes, you have no choice but to preprocess it

  				
Posted 
	one year ago

					More
				  		
  Report
		
					AmiableSeaturtle81
				
					0
					 × 1

Write your answer

2K Views

14 Answers

one year ago