Hi Everyone, Now I Am Evaluating Clearml. I Have A Question About How To Handle Datasets. Does Clearml Provide Any Function To Manage Datasets? Or Do We Need To Manage Them By Ourselves? In Our Usecase, We Update Datasets Little By Little Over Days Or W

Answered

Hi everyone, now I am evaluating clearml.

I have a question about how to handle datasets.
Does clearml provide any function to manage datasets?
Or do we need to manage them by ourselves?

In our usecase, we update datasets little by little over days or weeks, and run experiments against updated datasets accordingly.
Thus, combinations of samples, datasets and experiments easily get numerous.
We would like to make sure which samples are held by which dataset and which dataset is consumed by which experiment.

If clearml has anything to make this kind of thing easy, that would be nice.
Currently, our data are stored in s3.

  				
Posted 
	4 years ago

					More  		
  Report
		
					SoggyFrog26
				
					0
					 × 1

Votes Newest

Answers 13

Is there any example of how to use clearml-data ?

  				
Posted 
	4 years ago

					More  		
  Report
		
					SoggyFrog26
				
					0
					 × 1

Hi JitteryCoyote63 ,
Oh, you have somethings, Nice!
I will look into that document, thanks!

  				
Posted 
	4 years ago

					More  		
  Report
		
					SoggyFrog26
				
					0
					 × 1

Is it handling data just in a form of regular files?

  				
Posted 
	4 years ago

					More  		
  Report
		
					SoggyFrog26
				
					0
					 × 1

Yeah, as I have known that, now the CLI looks much more familiar to me.

  				
Posted 
	4 years ago

					More  		
  Report
		
					SoggyFrog26
				
					0
					 × 1

(I am not part of the awesome ClearML team, just a happy user 🙂 )

  				
Posted 
	4 years ago

					More  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

Oh, thanks 🙂

  				
Posted 
	4 years ago

					More  		
  Report
		
					SoggyFrog26
				
					0
					 × 1

I will let the team answer you on that one 🙂

  				
Posted 
	4 years ago

					More  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

BattyLion34 the closest I can think of the is monitoring class that can easily be extended.
Datasets are a type of Task, so we can monitor a project and trigger an action when we see a change in number of Tasks/Datasets that are completed.
Monitoring class:
https://github.com/allegroai/clearml/blob/master/clearml/automation/monitor.py
Monitoring example:
https://github.com/allegroai/clearml/blob/master/examples/services/monitoring/slack_alerts.py

I think a dataset monitoring example will be quite cool.

  				
Posted 
	4 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Hi SoggyFrog26 , https://github.com/allegroai/clearml/blob/master/docs/datasets.md

  				
Posted 
	4 years ago

					More  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

There is an example in the https://github.com/allegroai/clearml/blob/master/docs/datasets.md#workflow section of the linked I shared above

  				
Posted 
	4 years ago

					More  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

JitteryCoyote63 Is there an example of how the learning pipeline can be triggered (started) by changes in dataset?

  				
Posted 
	4 years ago

					More  		
  Report
		
					BattyLion34
				
					0
					 × 1

This is no coincidence - Any data versioning tool you will find are somehow close to how git works (dvc, etc.) since they aim to solve a similar problem. In the end, datasets are just files.
Where clearml-data stands out imo is the straightfoward CLI combined with the Pythonic API that allows you to register/retrieve datasets very easily

  				
Posted 
	4 years ago

					More  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

Hm, clearml-data looks very much like git.

  				
Posted 
	4 years ago

					More  		
  Report
		
					SoggyFrog26
				
					0
					 × 1

Write your answer

1K Views

13 Answers

4 years ago

2 years ago