Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hello, Im Having Huge Performance Issues On Large Clearml Datasets How Can I Link To Parent Dataset Without Parent Dataset Files. I Want To Create A Smaller Subset Of Parent Dataset, Like 5% Of It. To Achieve This, I Have To Call Remove_Files() To 60K It

Hello, im having huge performance issues on large Clearml Datasets

How can I link to parent dataset without parent dataset files. I want to create a smaller subset of parent dataset, like 5% of it. To achieve this, I have to call remove_files() to 60K items which is slow, it cant even take a list, just a SINGLE file???.
It would make more sense to just add these 5% to new dataset rather than remove 95% from parent

remove_files() takes 2 minutes to complete and I have 60K items to remove.

  
  
Posted 5 months ago
Votes Newest

Answers 14


You can check out boto3 python client (This is what we use to download / upload all S3 stuff), but minio-client probably already uses it under the hood.
We also use aws cli to do some downloading, it is way faster than python.

Regarding pdfs, yes, you have no choice but to preprocess it

  
  
Posted 5 months ago

Well I can understand well your problem. @<1590514584836378624:profile|AmiableSeaturtle81>
Please let me know if you have answers for my problem. 😄

  
  
Posted 5 months ago

And what I feel; pulling data by using some client eg: minio-client for minio is faster than clearml

  
  
Posted 5 months ago

@<1590514584836378624:profile|AmiableSeaturtle81>
Yes, I can also feel that slowness.

But what about if we have millions of pdfs and we need to generate a image dataset along with their bounding boxes. We cannot generate it on the fly while training the model as it will very very very slow.
So I’m doing all preprocessing the upload preprocessed data back to storage.
And while training again pulling preprocessed dataset from storage. 😄

  
  
Posted 5 months ago

@<1709740168430227456:profile|HomelyBluewhale47> We have the same problem. Millions of files, stored on CEPH. I would not recommend you to do it this way. Everything gets insanely slow (dataset.list_files, downloading the dataset, removing files)

The way I use Clearml Datasets for large number of samples now is to save a json which stores all paths to samples in Dataset metadata:
clearml_dataset.set_metadata(metadata, metadata_name=metadata_key)

However this then means that you need wrappers to download the dataset

  
  
Posted 5 months ago

Thanks a lot @<1523701435869433856:profile|SmugDolphin23>
will go through this one.
your quick reply really means a lot. Thanks again!!

  
  
Posted 5 months ago

Yes, see minio instructions under this: None

  
  
Posted 5 months ago

Everytime I start training; first download dataset using minio client and then do further operations.

Can we integrate minio easily with clearML?

  
  
Posted 5 months ago

Yes, this is what I’m doing currently. But strugginling to manage the versions and all.

  
  
Posted 5 months ago

@<1709740168430227456:profile|HomelyBluewhale47> you should be able to upload the images and download them without a problem. You could also use a cloud provider to store your files such as s3 if you believe it would speed things up

  
  
Posted 5 months ago

And just FYI: one dataset size can beyond 80GB.

  
  
Posted 5 months ago

Hi @<1523701435869433856:profile|SmugDolphin23>
This is follow up question from my side.

Is it efficient to use clearML data for hundred of thousands of image dataset. I’ve a concern about performance.

So basically I’m training a transformers classifier. which need at least 300-400K images.

  
  
Posted 5 months ago

otherwise, you could run this as a hack:

        dataset._dataset_file_entries = {
            k: v
            for k, v in self._dataset_file_entries.items()
            if k not in files_to_remove  # you need to define this
        }

then call dataset.remove_files with a path that doesn't exist in the dataset.

  
  
Posted 5 months ago

Hi @<1590514584836378624:profile|AmiableSeaturtle81> ! Looks like remove_files doesn't support lists indeed. It does support paths with wildcards tho, if that helps.
I would remove all the files to the dataset and add only the ones you need back as a workaround for now, or just create a new dataset

  
  
Posted 5 months ago
482 Views
14 Answers
5 months ago
5 months ago
Tags
Similar posts