Hello, Im Having Huge Performance Issues On Large Clearml Datasets How Can I Link To Parent Dataset Without Parent Dataset Files. I Want To Create A Smaller Subset Of Parent Dataset, Like 5% Of It. To Achieve This, I Have To Call Remove_Files() To 60K It

@<1709740168430227456:profile|HomelyBluewhale47> We have the same problem. Millions of files, stored on CEPH. I would not recommend you to do it this way. Everything gets insanely slow (dataset.list_files, downloading the dataset, removing files)

The way I use Clearml Datasets for large number of samples now is to save a json which stores all paths to samples in Dataset metadata:
clearml_dataset.set_metadata(metadata, metadata_name=metadata_key)

However this then means that you need wrappers to download the dataset

Posted 6 months ago
0 Answers
6 months ago
