Hi, can i ask how i can make Clearml-Datasets in comparison with PyTorch datasets/dataloader? In particular, pytorch dataloaders would be able to batch pull and then preprocess data using multi-cpus, feed it into the training loop and achieve as high utilisation of cpu and gpu at the same time. In the case of using clearml-datasets, what is the best practice of achieving this? Either with pytorch or anything that's built into ClearML?

Posted 2 years ago
Hi SubstantialElk6 ,

That's an interesting idea. I think if you want to preprocess a lot of data I think the best would be using multiple datasets (each per process) or different versions of datasets. Although I think you can also pull specific chunks of dataset and then you can use just the one - I'm not sure about the last point.

What do you think?

Posted 2 years ago

SubstantialElk6 , I think this is what you're looking for:
Dataset.get_local_copy(..., part=X)

Posted 2 years ago

Thanks CostlyOstrich36 , how do i know how is the parts indexed in the first place? Or rather, how is chunk and parts defined? Say in the context of images, videos, text documents...etc.

Posted 2 years ago

Although I think you can also pull specific chunks of dataset

How do you do that with clearml-data?

Posted 2 years ago

I think this might also be helpful. Gloss over the functions available in the documentation, I think you might find what you're looking for 🙂

Posted 2 years ago
