Creating a new dataset object for each batch allows me to just publish said batches introducing immutability.
It part of the design I think. It makes sense that if we want to keep track of changes, we always build on top of what we already have 🙂 I think of it like a commit: I'm adding files in a NEW commit, not in the old one.
Sorry for the late response. Agreed, that can work, although I would prefer a way to access the data by M number of batches added instead of a certain range, since these cases aren't interchangeable. Also a simple thing that can be done is that you can create an empty Dataset in the start, and then make it the parent of every dataset you add.
Also, since I plan to not train on the whole dataset and instead only on a subset of the data, I was thinking of making each batch of data a new dataset and then just merging the subset of data I want to train on.
Wait is it possible to do what i'm doing but with just one big Dataset object or something?
Don't know if that's possible yet, but maybe something like the proposed querying could help here?
I already have the dataset id as a hyperparameter. I get said dataset. I'm only handling one dataset right now but merging multiple ones is a simple task as well.
Also I'm not very experienced and am unsure what proposed querying is and how and if it works in ClearML here.
That makes sense! Maybe something like dataset querying as is used in the clearml hyperdatasets might be useful here? Basically you'd query your dataset to only include sample you want and have the query itself be a hyperparameter in your experiment?
Hi Fawad!
You should be able to get a local mutable copy using Dataset.get_mutable_local_copy
and then creating a new dataset.
But personally I prefer this workflow:
dataset = Dataset.get(dataset_project=CLEARML_PROJECT, dataset_name=CLEARML_DATASET_NAME, auto_create=True, writable_copy=True) dataset.add_files(path=save_path, dataset_path=save_path) dataset.finalize(auto_upload=True)
The writable_copy
argument gets a dataset and creates a child of it (a new dataset with your selected one as parent). In this way you can just add some files and upload the whole thing. It will now contain everything your previous dataset did + the files you added AND keep track of the previous dataset. In this way clearml knows not to upload the data that was already there, it will only upload your newly added files.
auto_create
will create a dataset is none exist yetauto_upload=True
is basically the same as first uploading and then finalizing
These 3 lines use functionality that's only just available, so make sure to have the latest clearml version :)
I was looking to see if I can just get away with using get_local_copy instead of the mutable one but I guess that is unavoidable.
So you train the model only on those N preprocessed data points then? Never combined with the previous datapoints before N?
Well I'm still researching how it'll work. I'm expecting it to not be very good and will make the model learning very stochastic in nature.
I plan to instead at the training stage, instead of just getting this model, use Dataset.squash, to get previous M datasets merged together.
This should introduce stability in the dataset.
Also this way, our model is trained on a batch of data multiple times but only for a few times before that batch is discarded. We keep the training data fresh for continuous training hopefully reducing data drift caused by time.
Wait is it possible to do what i'm doing but with just one big Dataset object or something?
My current approach is, watch a folder, when there are sufficient data points, just move N of them into another folder and create a raw dataset and call the pipeline with this dataset.
It gets downloaded, preprocessed, and then uploaded again.
In the final step, the preprocessed dataset is downloaded and is used to train the model.
Cool! 😄 Yeah, that makes sense.
So (just brainstorming here) imagine you have your dataset with all samples inside. Every time N new samples arrive they're just added to the larger dataset in an incremental way (with the 3 lines I sent earlier).
So imagine if we could query/filter that large dataset to only include a certain datetime range. That range filter is then stored as hyperparameter too, so in that case, you could easily rerun the same training task multiple times, on different amounts of data, by just changing the daterange parameter in the interface. It could help to find out the best interval to take maybe?
I'm just asking you if that would make sense, because I've been thinking about this functionality for my own usecases too 🙂 Would be cool to contribute it