My current approach is, watch a folder, when there are sufficient data points, just move N of them into another folder and create a raw dataset and call the pipeline with this dataset.
It gets downloaded, preprocessed, and then uploaded again.
In the final step, the preprocessed dataset is downloaded and is used to train the model.
You should be able to get a local mutable copy using
Dataset.get_mutable_local_copy and then creating a new dataset.
But personally I prefer this workflow:
dataset = Dataset.get(dataset_project=CLEARML_PROJECT, dataset_name=CLEARML_DATASET_NAME, auto_create=True, writable_copy=True) dataset.add_files(path=save_path, dataset_path=save_path) dataset.finalize(auto_upload=True)
writable_copy argument gets a dataset and creates a child of it (a new dataset with your selected one as parent). In this way you can just add some files and upload the whole thing. It will now contain everything your previous dataset did + the files you added AND keep track of the previous dataset. In this way clearml knows not to upload the data that was already there, it will only upload your newly added files.
auto_create will create a dataset is none exist yet
auto_upload=True is basically the same as first uploading and then finalizing
These 3 lines use functionality that's only just available, so make sure to have the latest clearml version :)
Sorry for the late response. Agreed, that can work, although I would prefer a way to access the data by M number of batches added instead of a certain range, since these cases aren't interchangeable. Also a simple thing that can be done is that you can create an empty Dataset in the start, and then make it the parent of every dataset you add.
That makes sense! Maybe something like dataset querying as is used in the clearml hyperdatasets might be useful here? Basically you'd query your dataset to only include sample you want and have the query itself be a hyperparameter in your experiment?
Cool! 😄 Yeah, that makes sense.
So (just brainstorming here) imagine you have your dataset with all samples inside. Every time N new samples arrive they're just added to the larger dataset in an incremental way (with the 3 lines I sent earlier).
So imagine if we could query/filter that large dataset to only include a certain datetime range. That range filter is then stored as hyperparameter too, so in that case, you could easily rerun the same training task multiple times, on different amounts of data, by just changing the daterange parameter in the interface. It could help to find out the best interval to take maybe?
I'm just asking you if that would make sense, because I've been thinking about this functionality for my own usecases too 🙂 Would be cool to contribute it
Well I'm still researching how it'll work. I'm expecting it to not be very good and will make the model learning very stochastic in nature.
I plan to instead at the training stage, instead of just getting this model, use Dataset.squash, to get previous M datasets merged together.
This should introduce stability in the dataset.
Also this way, our model is trained on a batch of data multiple times but only for a few times before that batch is discarded. We keep the training data fresh for continuous training hopefully reducing data drift caused by time.
I already have the dataset id as a hyperparameter. I get said dataset. I'm only handling one dataset right now but merging multiple ones is a simple task as well.
Also I'm not very experienced and am unsure what proposed querying is and how and if it works in ClearML here.