When Using Dataset.Get_Local_Copy(), Once I Get The Location, Can I Add Another Folder Inside Location Add Some Files In It, Create A New Dataset Object, And Then Do Dataset.Upload(Location)? Should This Work? Or Since Its Get_Local_Copy, I Won'T Be Able

Answered

When using Dataset.get_local_copy(), once I get the location, can I add another folder inside location add some files in it, create a new Dataset object, and then do Dataset.upload(location)? Should this work? Or since its get_local_copy, I won't be able to mutate it since it said its immutable?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					VexedCat68
				
					0
					 × 1

Votes Newest

Answers 14

Also, since I plan to not train on the whole dataset and instead only on a subset of the data, I was thinking of making each batch of data a new dataset and then just merging the subset of data I want to train on.

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					VexedCat68
				
					0
					 × 1

My current approach is, watch a folder, when there are sufficient data points, just move N of them into another folder and create a raw dataset and call the pipeline with this dataset.

It gets downloaded, preprocessed, and then uploaded again.

In the final step, the preprocessed dataset is downloaded and is used to train the model.

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					VexedCat68
				
					0
					 × 1

So you train the model only on those N preprocessed data points then? Never combined with the previous datapoints before N?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					ExasperatedCrab78
				
					0
					 × 1

Cool! 😄 Yeah, that makes sense.

So (just brainstorming here) imagine you have your dataset with all samples inside. Every time N new samples arrive they're just added to the larger dataset in an incremental way (with the 3 lines I sent earlier).
So imagine if we could query/filter that large dataset to only include a certain datetime range. That range filter is then stored as hyperparameter too, so in that case, you could easily rerun the same training task multiple times, on different amounts of data, by just changing the daterange parameter in the interface. It could help to find out the best interval to take maybe?

I'm just asking you if that would make sense, because I've been thinking about this functionality for my own usecases too 🙂 Would be cool to contribute it

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					ExasperatedCrab78
				
					0
					 × 1

It part of the design I think. It makes sense that if we want to keep track of changes, we always build on top of what we already have 🙂 I think of it like a commit: I'm adding files in a NEW commit, not in the old one.

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					ExasperatedCrab78
				
					0
					 × 1

That makes sense! Maybe something like dataset querying as is used in the clearml hyperdatasets might be useful here? Basically you'd query your dataset to only include sample you want and have the query itself be a hyperparameter in your experiment?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					ExasperatedCrab78
				
					0
					 × 1

I already have the dataset id as a hyperparameter. I get said dataset. I'm only handling one dataset right now but merging multiple ones is a simple task as well.

Also I'm not very experienced and am unsure what proposed querying is and how and if it works in ClearML here.

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					VexedCat68
				
					0
					 × 1

Hi Fawad!
You should be able to get a local mutable copy using Dataset.get_mutable_local_copy and then creating a new dataset.
But personally I prefer this workflow:

dataset = Dataset.get(dataset_project=CLEARML_PROJECT, dataset_name=CLEARML_DATASET_NAME, auto_create=True, writable_copy=True) dataset.add_files(path=save_path, dataset_path=save_path) dataset.finalize(auto_upload=True)
The writable_copy argument gets a dataset and creates a child of it (a new dataset with your selected one as parent). In this way you can just add some files and upload the whole thing. It will now contain everything your previous dataset did + the files you added AND keep track of the previous dataset. In this way clearml knows not to upload the data that was already there, it will only upload your newly added files.

auto_create will create a dataset is none exist yet
auto_upload=True is basically the same as first uploading and then finalizing

These 3 lines use functionality that's only just available, so make sure to have the latest clearml version :)

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					ExasperatedCrab78
				
					0
					 × 1

Wait is it possible to do what i'm doing but with just one big Dataset object or something?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					VexedCat68
				
					0
					 × 1

Wait is it possible to do what i'm doing but with just one big Dataset object or something?

Don't know if that's possible yet, but maybe something like the proposed querying could help here?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					ExasperatedCrab78
				
					0
					 × 1

I was looking to see if I can just get away with using get_local_copy instead of the mutable one but I guess that is unavoidable.

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					VexedCat68
				
					0
					 × 1

Sorry for the late response. Agreed, that can work, although I would prefer a way to access the data by M number of batches added instead of a certain range, since these cases aren't interchangeable. Also a simple thing that can be done is that you can create an empty Dataset in the start, and then make it the parent of every dataset you add.

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					VexedCat68
				
					0
					 × 1

Creating a new dataset object for each batch allows me to just publish said batches introducing immutability.

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					VexedCat68
				
					0
					 × 1

Well I'm still researching how it'll work. I'm expecting it to not be very good and will make the model learning very stochastic in nature.

I plan to instead at the training stage, instead of just getting this model, use Dataset.squash, to get previous M datasets merged together.

This should introduce stability in the dataset.

Also this way, our model is trained on a batch of data multiple times but only for a few times before that batch is discarded. We keep the training data fresh for continuous training hopefully reducing data drift caused by time.

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					VexedCat68
				
					0
					 × 1

Write your answer

2K Views

14 Answers

3 years ago

2 years ago