@<1709740168430227456:profile|HomelyBluewhale47> We have the same problem. Millions of files, stored on CEPH. I would not recommend you to do it this way. Everything gets insanely slow (dataset.list_files, downloading the dataset, removing files)
The way I use Clearml Datasets for large number of samples now is to save a json which stores all paths to samples in Dataset metadata:
clearml_dataset.set_metadata(metadata, metadata_name=metadata_key)
However this then means that you need wrappers to download the dataset
Hi @<1590514584836378624:profile|AmiableSeaturtle81> ! Looks like remove_files
doesn't support lists indeed. It does support paths with wildcards tho, if that helps.
I would remove all the files to the dataset and add only the ones you need back as a workaround for now, or just create a new dataset
Everytime I start training; first download dataset using minio client and then do further operations.
Can we integrate minio easily with clearML?
@<1590514584836378624:profile|AmiableSeaturtle81>
Yes, I can also feel that slowness.
But what about if we have millions of pdfs and we need to generate a image dataset along with their bounding boxes. We cannot generate it on the fly while training the model as it will very very very slow.
So I’m doing all preprocessing the upload preprocessed data back to storage.
And while training again pulling preprocessed dataset from storage. 😄
You can check out boto3 python client (This is what we use to download / upload all S3 stuff), but minio-client probably already uses it under the hood.
We also use aws cli to do some downloading, it is way faster than python.
Regarding pdfs, yes, you have no choice but to preprocess it
Well I can understand well your problem. @<1590514584836378624:profile|AmiableSeaturtle81>
Please let me know if you have answers for my problem. 😄
And what I feel; pulling data by using some client eg: minio-client for minio is faster than clearml
Hi @<1523701435869433856:profile|SmugDolphin23>
This is follow up question from my side.
Is it efficient to use clearML data for hundred of thousands of image dataset. I’ve a concern about performance.
So basically I’m training a transformers classifier. which need at least 300-400K images.
@<1709740168430227456:profile|HomelyBluewhale47> you should be able to upload the images and download them without a problem. You could also use a cloud provider to store your files such as s3 if you believe it would speed things up
Thanks a lot @<1523701435869433856:profile|SmugDolphin23>
will go through this one.
your quick reply really means a lot. Thanks again!!
And just FYI: one dataset size can beyond 80GB.
Yes, this is what I’m doing currently. But strugginling to manage the versions and all.
otherwise, you could run this as a hack:
dataset._dataset_file_entries = {
k: v
for k, v in self._dataset_file_entries.items()
if k not in files_to_remove # you need to define this
}
then call dataset.remove_files
with a path that doesn't exist in the dataset.