Hi @<1545216070686609408:profile|EnthusiasticCow4> ! I have an idea.
The flow would be like this: you create a dataset, the parent of that dataset would be the previously created dataset. The version will auto-bump. Then, you sync this dataset with the folder. Note that sync will return the number of added/modified/removed files. If all of these are 0, then you use Dataset.delete
on this dataset and break/continue, else you upload and finalize the dataset.
Something like:
parent = Dataset.get(dataset_name="[LTV] Dataset")
dataset = Dataset.create(..., parents=[parent.id])
synced = dataset.sync_folder(local_path=folder)
if not any(synced):
Dataset.delete(dataset.id)
return
dataset.connect/add_tags/upload()
dataset.finalize()
Interesting approach. I'll give that a try. Thanks for the reply!
Hi again @<1523701435869433856:profile|SmugDolphin23> ,
The approach you suggested seems to be working albeit with one issue. It does correctly identify the different versions of the dataset when new data is added, but I get an error when I try and finalize the dataset:
Code:
if self.task:
# get the parent dataset from the project
parent = self.clearml_dataset = Dataset.get(
dataset_name="[LTV] Dataset",
dataset_project="[LTV] Lifetime Value Model",
)
# generate the local dataset
dataset = Dataset.create(
dataset_name=f"[LTV] Dataset",
parent_datasets=[parent],
dataset_project="[LTV] Lifetime Value Model",
)
# check to see what local files are different from the remote
synced = self.clearml_dataset.sync_folder(
local_path=save_path, verbose=True
)
# if there aren't any differences, skip adding the data to the remote and link the parent
if not any(synced):
Dataset.delete(dataset.id)
self.task.connect(parent)
logger.info(f"Data already exists on ClearML remote. Skipping upload.")
# if there are differences, upload the data to the remote and link the new dataset
else:
if self.tags is not None:
self.clearml_dataset.add_tags(self.tags)
self.task.connect(self.clearml_dataset)
self.clearml_dataset.upload(
show_progress=True,
verbose=True,
)
self.clearml_dataset.finalize()
logger.info(f"Saved the data to ClearML remote.")
Error generated:
self.clearml_dataset.finalize()
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/home/natephysics/anaconda3/envs/LTV/lib/python3.9/site-packages/clearml/datasets/dataset.py", line 780, in finalize
raise ValueError("Cannot finalize dataset, status '{}' is not valid".format(status))
ValueError: Cannot finalize dataset, status 'completed' is not valid
It might be related to this output in the terminal:
Syncing folder data/processed : 0 files removed, 1 added / modified
Compressing /LTV/data/processed/2022-11.csv
Uploading dataset changes (1 files compressed to 4.2 MiB) to
2023-04-18 14:40:01,068 - clearml.Task - ERROR - Action failed <400/110: tasks.add_or_update_artifacts/v2.10 (Invalid task status: expected=created, status=completed)> (task=####, artifacts=[{'key': 'state', 'type': 'dict', 'uri': '
', 'content_size': 34672, 'hash': '####', 'timestamp': 1681821600, 'type_data': {'preview': 'Dataset state\nFiles added/modified: 120 - total size 497.24 MB\nCurrent dependency graph: {\n "f047337f5b84428e9c53c0bf67915c46": []\n}\n', 'content_type': 'application/json'}, 'display_data': [('files added', '120'), ('files modified', '0'), ('files removed', '0')]}], force=True)
2023-04-18 14:40:02,317 - clearml.Task - ERROR - Action failed <400/110: tasks.add_or_update_artifacts/v2.10 (Invalid task status: expected=created, status=completed)> (task=####, artifacts=[{'key': 'state', 'type': 'dict', 'uri': '
', 'content_size': 34672, 'hash': '####', 'timestamp': 1681821600, 'type_data': {'preview': 'Dataset state\nFiles added/modified: 120 - total size 497.24 MB\nCurrent dependency graph: {\n "f047337f5b84428e9c53c0bf67915c46": []\n}\n', 'content_type': 'application/json'}, 'display_data': [('files added', '120'), ('files modified', '0'), ('files removed', '0')]}, {'key': 'data_001', 'type': 'custom', 'uri': '
', 'content_size': 4403531, 'hash': '####', 'timestamp': 1681821601, 'type_data': {'preview': '2022-11.csv - 4.4 MB\n', 'content_type': 'application/zip'}}], force=True)
File compression and upload completed: total size 4.2 MiB, 1 chunk(s) stored (average size 4.2 MiB)
2023-04-18 14:40:02,980 - clearml.Task - ERROR - Action failed <400/110: tasks.add_or_update_artifacts/v2.10 (Invalid task status: expected=created, status=completed)> (task=f047337f5b84428e9c53c0bf67915c46, artifacts=[{'key': 'data_001', 'type': 'custom', 'uri': '
', 'content_size': 4403531, 'hash': '####', 'timestamp': 1681821601, 'type_data': {'preview': '2022-11.csv - 4.4 MB\n', 'content_type': 'application/zip'}}, {'key': 'state', 'type': 'dict', 'uri': '
', 'content_size': 34673, 'hash': '####', 'timestamp': 1681821602, 'type_data': {'preview': 'Dataset state\nFiles added/modified: 120 - total size 497.24 MB\nCurrent dependency graph: {\n "f047337f5b84428e9c53c0bf67915c46": []\n}\n', 'content_type': 'application/json'}, 'display_data': [('files added', '120'), ('files modified', '0'), ('files removed', '0')]}], force=True)
I'm still trying to wrap my head around the intuition for "finalize". I assume finalize would lock in the dataset for that task/application of the data. I'm not sure why in this case I would be unable to finalize the dataset.
On a related question, am I correctly "linking" the dataset to the task? If I look at the PROJECTS / task in question, under the info tab I to get a keyword for datasets with the ID, but there doesn't appear to be any link from the project to the dataset. I can, of course, go to the dataset and search for that ID to find it. I'm still learning the interface and I'm curious if there's a "link" between the task and the data? Is there any way to look at all the tasks that used that version of the dataset?
@<1545216070686609408:profile|EnthusiasticCow4>
This:
parent = self.clearml_dataset = Dataset.get(
dataset_name="[LTV] Dataset",
dataset_project="[LTV] Lifetime Value Model",
)
# generate the local dataset
dataset = Dataset.create(
dataset_name=f"[LTV] Dataset",
parent_datasets=[parent],
dataset_project="[LTV] Lifetime Value Model",
)
should likely be this:
parent = Dataset.get(
dataset_name="[LTV] Dataset",
dataset_project="[LTV] Lifetime Value Model",
)
# generate the local dataset
dataset = self.clearml_dataset = Dataset.create(
dataset_name=f"[LTV] Dataset",
parent_datasets=[parent],
dataset_project="[LTV] Lifetime Value Model",
)
your new dataset would be the child, which then you will be able to upload and finalize.
Here, you should likely revert this:
if not any(synced):
Dataset.delete(dataset.id)
self.task.connect(parent)
logger.info(f"Data already exists on ClearML remote. Skipping upload.")
self.clearml_dataset = parent # notice this line
Is there any way to look at all the tasks that used that version of the dataset?
Not easily. You could query the runtime properties of all tasks and check for datasets used.
But what I would do is tag the task that uses a certain dataset, and then you should be able to query by tags
Let me give that a try. Thanks for all the help.