So we basically have two options, one is when you call Dataset.get_local_copy()
, we register it on the Task automatically, the other is a more explicit, with something like:ds = Datasset.get(...) folder = ds.get_local_copy() task.connect(ds, name=train) ... ds_val = Datasset.get(...) folder = ds_val.get_local_copy() task.connect(ds_val, name=validate)
wdyt?
Hi, it make sense to automate this part just like how you automate the rest of the MLOps flow, especially when you already support Data Versioning/Lineage, Data Provenance (How it works with the experiment and as a model source) should be in too. Although i agree technically it's probably not possible to tell if the users actually used the indicated datasets after they do a datasets.get_copy()
.
Sorry AgitatedDove14 i missed your reply. So this means that in the community version, when i have an experiment using clearml and it uses clearml datasets SDK, the dataset id that was used will not be reflected on the clearml experiment in any way, thus making it impossible to establish Data Lineage/Provenance. (E.g. Link data used to experiment). This feature is however available in the Enterprise Version as HyperDatasets. Am i correct?
Code example.from clearml import Task, Logger task = Task.init(project_name='DETECTRON2',task_name='Default Model Architecture',task_type='training', output_uri='
` ')
task.set_base_docker("quay.io/detectron2:v4 --env GIT_SSL_NO_VERIFY=true --env TRAINS_AGENT_GIT_USER=testuser --env TRAINS_AGENT_GIT_PASS=testuser" )
task.execute_remotely(queue_name="1xV100-4ram", exit_process=True)
dataset_id = "83cfb45cfcbb4a8293ed9f14a2c562c0"
from clearml import Dataset
dataset_path = Dataset.get(dataset_id=dataset_id).get_local_copy() `
feature is however available in the Enterprise Version as HyperDatasets. Am i correct?
Correct
BTW you could do:datasets_used = dict(dataset_id="83cfb45cfcbb4a8293ed9f14a2c562c0") task.connect(datasets_used, name='datasets') from clearml import Dataset dataset_path = Dataset.get(dataset_id=datasets_used['dataset_id']).get_local_copy()
This will ensure that not only you have a new section called "datasets" on the Task's configuration, buy tou will also be able to replace the dataset_id from the UI, and launch using the agent
Okay this part I missed, why would you need to add additional "catalog" when you have the UI?
Yeah this is the part i am trying to reconcile. I don't see any UI for datasets, Or is this a feature of hyperdatasets and i just mixed them up.
Or is this a feature of hyperdatasets and i just mixed them up.
Ohh yes, this is it. Hyper Datasets are part of the UI (i.e. there is a Tab with the HyperDataset query) Dataset Usage is currently listed on the Task. make sense ?
Thanks this would be a good alternative before the enterprise version comes in. How is this different from argparser btw?
i'm Jax, not Manoj! lol.
I know 😄 I just mentioned that this issue is being actively discussed
How is this different from argparser btw?
Not different, just a dedicated section 🙂 Maybe we should do that automatically, the only "downside" is you will have to name the Dataset when getting it (so it will have an entry name in the Dataset section), wdyt ?
SubstantialElk6 , do you mean the dataset task version?
So the context I'm asking is I realise I'll need to catalogue all the dataset ids created by ppl separately on a spreadsheet. And for each experiment, I'll need to go into the code commit to see which id is being used. But on the other hand, I thought I've seen advertised use cases where the experiment can be directly linked to the dataset id being used. The brain's a bit rusty to recall how it was done.
can you bump me to that thread?
https://clearml.slack.com/archives/CTK20V944/p1630610430171200
I realise I'll need to catalogue all the dataset ids created by ppl separately on a spreadsheet.
Okay this part I missed, why would you need to add additional "catalog" when you have the UI?
AgitatedDove14 , i'm Jax, not Manoj! lol. 😅 😅
Sorry AgitatedDove14 can you bump me to that thread?
How do I tell from the ClearML UI which datasets version am I using?
Hi SubstantialElk6 , what exactly do you mean by "ClearML UI which datasets am I using" ? Do you mean is there an auto magic adding the dataset ID when you call Data.get() in your code ? (because if you are I specifically remember discussing adding this feature a few days ago, which you just bumped the priority of 😉 )
SubstantialElk6 , can you view the dataset in the UI? Can you please provide a screenshot so I can mark it down for you