Hi I'M Using Clearml Datasets. How Do I Tell From The Clearml Ui Which Datasets Version Am I Using?

Answered

Hi I'm using clearml datasets. How do I tell from the ClearML UI which datasets version am I using?

  				
Posted 
	3 years ago

					More  		
  Report
		
					SubstantialElk6
				
					0
					 × 1

Votes Newest

Answers 17

Thanks this would be a good alternative before the enterprise version comes in. How is this different from argparser btw?

  				
Posted 
	3 years ago

					More  		
  Report
		
					SubstantialElk6
				
					0
					 × 1

SubstantialElk6 , can you view the dataset in the UI? Can you please provide a screenshot so I can mark it down for you

  				
Posted 
	3 years ago

					More  		
  Report
		
					CostlyOstrich36
				
					0

Sorry AgitatedDove14 can you bump me to that thread?

  				
Posted 
	3 years ago

					More  		
  Report
		
					SubstantialElk6
				
					0
					 × 1

Sorry AgitatedDove14 i missed your reply. So this means that in the community version, when i have an experiment using clearml and it uses clearml datasets SDK, the dataset id that was used will not be reflected on the clearml experiment in any way, thus making it impossible to establish Data Lineage/Provenance. (E.g. Link data used to experiment). This feature is however available in the Enterprise Version as HyperDatasets. Am i correct?
Code example.
from clearml import Task, Logger task = Task.init(project_name='DETECTRON2',task_name='Default Model Architecture',task_type='training', output_uri=' ` ')
task.set_base_docker("quay.io/detectron2:v4 --env GIT_SSL_NO_VERIFY=true --env TRAINS_AGENT_GIT_USER=testuser --env TRAINS_AGENT_GIT_PASS=testuser" )
task.execute_remotely(queue_name="1xV100-4ram", exit_process=True)

dataset_id = "83cfb45cfcbb4a8293ed9f14a2c562c0"
from clearml import Dataset
dataset_path = Dataset.get(dataset_id=dataset_id).get_local_copy() `

  				
Posted 
	3 years ago

					More  		
  Report
		
					SubstantialElk6
				
					0
					 × 1

AgitatedDove14 , i'm Jax, not Manoj! lol. 😅 😅

  				
Posted 
	3 years ago

					More  		
  Report
		
					SubstantialElk6
				
					0
					 × 1

How do I tell from the ClearML UI which datasets version am I using?

Hi SubstantialElk6 , what exactly do you mean by "ClearML UI which datasets am I using" ? Do you mean is there an auto magic adding the dataset ID when you call Data.get() in your code ? (because if you are I specifically remember discussing adding this feature a few days ago, which you just bumped the priority of 😉 )

  				
Posted 
	3 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

can you bump me to that thread?

https://clearml.slack.com/archives/CTK20V944/p1630610430171200

I realise I'll need to catalogue all the dataset ids created by ppl separately on a spreadsheet.

Okay this part I missed, why would you need to add additional "catalog" when you have the UI?

  				
Posted 
	3 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

i'm Jax, not Manoj! lol.

I know 😄 I just mentioned that this issue is being actively discussed

  				
Posted 
	3 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

SubstantialElk6 , do you mean the dataset task version?

  				
Posted 
	3 years ago

					More  		
  Report
		
					CostlyOstrich36
				
					0

Okay this part I missed, why would you need to add additional "catalog" when you have the UI?

Yeah this is the part i am trying to reconcile. I don't see any UI for datasets, Or is this a feature of hyperdatasets and i just mixed them up.

  				
Posted 
	3 years ago

					More  		
  Report
		
					SubstantialElk6
				
					0
					 × 1

I meant the dataset id.

  				
Posted 
	3 years ago

					More  		
  Report
		
					SubstantialElk6
				
					0
					 × 1

feature is however available in the Enterprise Version as HyperDatasets. Am i correct?

Correct
BTW you could do:
datasets_used = dict(dataset_id="83cfb45cfcbb4a8293ed9f14a2c562c0") task.connect(datasets_used, name='datasets') from clearml import Dataset dataset_path = Dataset.get(dataset_id=datasets_used['dataset_id']).get_local_copy()This will ensure that not only you have a new section called "datasets" on the Task's configuration, buy tou will also be able to replace the dataset_id from the UI, and launch using the agent

  				
Posted 
	3 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

So the context I'm asking is I realise I'll need to catalogue all the dataset ids created by ppl separately on a spreadsheet. And for each experiment, I'll need to go into the code commit to see which id is being used. But on the other hand, I thought I've seen advertised use cases where the experiment can be directly linked to the dataset id being used. The brain's a bit rusty to recall how it was done.

  				
Posted 
	3 years ago

					More  		
  Report
		
					SubstantialElk6
				
					0
					 × 1

How is this different from argparser btw?

Not different, just a dedicated section 🙂 Maybe we should do that automatically, the only "downside" is you will have to name the Dataset when getting it (so it will have an entry name in the Dataset section), wdyt ?

  				
Posted 
	3 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

So we basically have two options, one is when you call Dataset.get_local_copy() , we register it on the Task automatically, the other is a more explicit, with something like:
ds = Datasset.get(...) folder = ds.get_local_copy() task.connect(ds, name=train) ... ds_val = Datasset.get(...) folder = ds_val.get_local_copy() task.connect(ds_val, name=validate)wdyt?

  				
Posted 
	3 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Or is this a feature of hyperdatasets and i just mixed them up.

Ohh yes, this is it. Hyper Datasets are part of the UI (i.e. there is a Tab with the HyperDataset query) Dataset Usage is currently listed on the Task. make sense ?

  				
Posted 
	3 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Hi, it make sense to automate this part just like how you automate the rest of the MLOps flow, especially when you already support Data Versioning/Lineage, Data Provenance (How it works with the experiment and as a model source) should be in too. Although i agree technically it's probably not possible to tell if the users actually used the indicated datasets after they do a datasets.get_copy() .

  				
Posted 
	3 years ago

					More  		
  Report
		
					SubstantialElk6
				
					0
					 × 1

Write your answer

1K Views

17 Answers

3 years ago

2 years ago