Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hi I'M Using Clearml Datasets. How Do I Tell From The Clearml Ui Which Datasets Version Am I Using?

Hi I'm using clearml datasets. How do I tell from the ClearML UI which datasets version am I using?

  
  
Posted 3 years ago
Votes Newest

Answers 17


Sorry AgitatedDove14 i missed your reply. So this means that in the community version, when i have an experiment using clearml and it uses clearml datasets SDK, the dataset id that was used will not be reflected on the clearml experiment in any way, thus making it impossible to establish Data Lineage/Provenance. (E.g. Link data used to experiment). This feature is however available in the Enterprise Version as HyperDatasets. Am i correct?
Code example.
from clearml import Task, Logger task = Task.init(project_name='DETECTRON2',task_name='Default Model Architecture',task_type='training', output_uri=' ` ')
task.set_base_docker("quay.io/detectron2:v4 --env GIT_SSL_NO_VERIFY=true --env TRAINS_AGENT_GIT_USER=testuser --env TRAINS_AGENT_GIT_PASS=testuser" )
task.execute_remotely(queue_name="1xV100-4ram", exit_process=True)

dataset_id = "83cfb45cfcbb4a8293ed9f14a2c562c0"
from clearml import Dataset
dataset_path = Dataset.get(dataset_id=dataset_id).get_local_copy() `

  
  
Posted 3 years ago

Sorry AgitatedDove14 can you bump me to that thread?

  
  
Posted 3 years ago

AgitatedDove14 , i'm Jax, not Manoj! lol. 😅 😅

  
  
Posted 3 years ago

So we basically have two options, one is when you call Dataset.get_local_copy() , we register it on the Task automatically, the other is a more explicit, with something like:
ds = Datasset.get(...) folder = ds.get_local_copy() task.connect(ds, name=train) ... ds_val = Datasset.get(...) folder = ds_val.get_local_copy() task.connect(ds_val, name=validate)wdyt?

  
  
Posted 3 years ago

SubstantialElk6 , do you mean the dataset task version?

  
  
Posted 3 years ago

How is this different from argparser btw?

Not different, just a dedicated section 🙂 Maybe we should do that automatically, the only "downside" is you will have to name the Dataset when getting it (so it will have an entry name in the Dataset section), wdyt ?

  
  
Posted 3 years ago

Or is this a feature of hyperdatasets and i just mixed them up.

Ohh yes, this is it. Hyper Datasets are part of the UI (i.e. there is a Tab with the HyperDataset query) Dataset Usage is currently listed on the Task. make sense ?

  
  
Posted 3 years ago

feature is however available in the Enterprise Version as HyperDatasets. Am i correct?

Correct
BTW you could do:
datasets_used = dict(dataset_id="83cfb45cfcbb4a8293ed9f14a2c562c0") task.connect(datasets_used, name='datasets') from clearml import Dataset dataset_path = Dataset.get(dataset_id=datasets_used['dataset_id']).get_local_copy()This will ensure that not only you have a new section called "datasets" on the Task's configuration, buy tou will also be able to replace the dataset_id from the UI, and launch using the agent

  
  
Posted 3 years ago

SubstantialElk6 , can you view the dataset in the UI? Can you please provide a screenshot so I can mark it down for you

  
  
Posted 3 years ago

So the context I'm asking is I realise I'll need to catalogue all the dataset ids created by ppl separately on a spreadsheet. And for each experiment, I'll need to go into the code commit to see which id is being used. But on the other hand, I thought I've seen advertised use cases where the experiment can be directly linked to the dataset id being used. The brain's a bit rusty to recall how it was done.

  
  
Posted 3 years ago

Okay this part I missed, why would you need to add additional "catalog" when you have the UI?

Yeah this is the part i am trying to reconcile. I don't see any UI for datasets, Or is this a feature of hyperdatasets and i just mixed them up.

  
  
Posted 3 years ago

Thanks this would be a good alternative before the enterprise version comes in. How is this different from argparser btw?

  
  
Posted 3 years ago

i'm Jax, not Manoj! lol.

I know 😄 I just mentioned that this issue is being actively discussed

  
  
Posted 3 years ago

Hi, it make sense to automate this part just like how you automate the rest of the MLOps flow, especially when you already support Data Versioning/Lineage, Data Provenance (How it works with the experiment and as a model source) should be in too. Although i agree technically it's probably not possible to tell if the users actually used the indicated datasets after they do a datasets.get_copy() .

  
  
Posted 3 years ago

can you bump me to that thread?

https://clearml.slack.com/archives/CTK20V944/p1630610430171200

I realise I'll need to catalogue all the dataset ids created by ppl separately on a spreadsheet.

Okay this part I missed, why would you need to add additional "catalog" when you have the UI?

  
  
Posted 3 years ago

I meant the dataset id.

  
  
Posted 3 years ago

How do I tell from the ClearML UI which datasets version am I using?

Hi SubstantialElk6 , what exactly do you mean by "ClearML UI which datasets am I using" ? Do you mean is there an auto magic adding the dataset ID when you call Data.get() in your code ? (because if you are I specifically remember discussing adding this feature a few days ago, which you just bumped the priority of 😉 )

  
  
Posted 3 years ago