Hi again @<1523701435869433856:profile|SmugDolphin23> ,
The approach you suggested seems to be working albeit with one issue. It does correctly identify the different versions of the dataset when new data is added, but I get an error when I try and finalize the dataset:
Code:
if self.task:
# get the parent dataset from the project
parent = self.clearml_dataset = Dataset.get(
dataset_name="[LTV] Dataset",
dataset_project="[LTV] Lifetime Value Model",
)
# generate the local dataset
dataset = Dataset.create(
dataset_name=f"[LTV] Dataset",
parent_datasets=[parent],
dataset_project="[LTV] Lifetime Value Model",
)
# check to see what local files are different from the remote
synced = self.clearml_dataset.sync_folder(
local_path=save_path, verbose=True
)
# if there aren't any differences, skip adding the data to the remote and link the parent
if not any(synced):
Dataset.delete(dataset.id)
self.task.connect(parent)
logger.info(f"Data already exists on ClearML remote. Skipping upload.")
# if there are differences, upload the data to the remote and link the new dataset
else:
if self.tags is not None:
self.clearml_dataset.add_tags(self.tags)
self.task.connect(self.clearml_dataset)
self.clearml_dataset.upload(
show_progress=True,
verbose=True,
)
self.clearml_dataset.finalize()
logger.info(f"Saved the data to ClearML remote.")
Error generated:
self.clearml_dataset.finalize()
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/home/natephysics/anaconda3/envs/LTV/lib/python3.9/site-packages/clearml/datasets/dataset.py", line 780, in finalize
raise ValueError("Cannot finalize dataset, status '{}' is not valid".format(status))
ValueError: Cannot finalize dataset, status 'completed' is not valid
It might be related to this output in the terminal:
Syncing folder data/processed : 0 files removed, 1 added / modified
Compressing /LTV/data/processed/2022-11.csv
Uploading dataset changes (1 files compressed to 4.2 MiB) to
2023-04-18 14:40:01,068 - clearml.Task - ERROR - Action failed <400/110: tasks.add_or_update_artifacts/v2.10 (Invalid task status: expected=created, status=completed)> (task=####, artifacts=[{'key': 'state', 'type': 'dict', 'uri': '
', 'content_size': 34672, 'hash': '####', 'timestamp': 1681821600, 'type_data': {'preview': 'Dataset state\nFiles added/modified: 120 - total size 497.24 MB\nCurrent dependency graph: {\n "f047337f5b84428e9c53c0bf67915c46": []\n}\n', 'content_type': 'application/json'}, 'display_data': [('files added', '120'), ('files modified', '0'), ('files removed', '0')]}], force=True)
2023-04-18 14:40:02,317 - clearml.Task - ERROR - Action failed <400/110: tasks.add_or_update_artifacts/v2.10 (Invalid task status: expected=created, status=completed)> (task=####, artifacts=[{'key': 'state', 'type': 'dict', 'uri': '
', 'content_size': 34672, 'hash': '####', 'timestamp': 1681821600, 'type_data': {'preview': 'Dataset state\nFiles added/modified: 120 - total size 497.24 MB\nCurrent dependency graph: {\n "f047337f5b84428e9c53c0bf67915c46": []\n}\n', 'content_type': 'application/json'}, 'display_data': [('files added', '120'), ('files modified', '0'), ('files removed', '0')]}, {'key': 'data_001', 'type': 'custom', 'uri': '
', 'content_size': 4403531, 'hash': '####', 'timestamp': 1681821601, 'type_data': {'preview': '2022-11.csv - 4.4 MB\n', 'content_type': 'application/zip'}}], force=True)
File compression and upload completed: total size 4.2 MiB, 1 chunk(s) stored (average size 4.2 MiB)
2023-04-18 14:40:02,980 - clearml.Task - ERROR - Action failed <400/110: tasks.add_or_update_artifacts/v2.10 (Invalid task status: expected=created, status=completed)> (task=f047337f5b84428e9c53c0bf67915c46, artifacts=[{'key': 'data_001', 'type': 'custom', 'uri': '
', 'content_size': 4403531, 'hash': '####', 'timestamp': 1681821601, 'type_data': {'preview': '2022-11.csv - 4.4 MB\n', 'content_type': 'application/zip'}}, {'key': 'state', 'type': 'dict', 'uri': '
', 'content_size': 34673, 'hash': '####', 'timestamp': 1681821602, 'type_data': {'preview': 'Dataset state\nFiles added/modified: 120 - total size 497.24 MB\nCurrent dependency graph: {\n "f047337f5b84428e9c53c0bf67915c46": []\n}\n', 'content_type': 'application/json'}, 'display_data': [('files added', '120'), ('files modified', '0'), ('files removed', '0')]}], force=True)
I'm still trying to wrap my head around the intuition for "finalize". I assume finalize would lock in the dataset for that task/application of the data. I'm not sure why in this case I would be unable to finalize the dataset.
On a related question, am I correctly "linking" the dataset to the task? If I look at the PROJECTS / task in question, under the info tab I to get a keyword for datasets with the ID, but there doesn't appear to be any link from the project to the dataset. I can, of course, go to the dataset and search for that ID to find it. I'm still learning the interface and I'm curious if there's a "link" between the task and the data? Is there any way to look at all the tasks that used that version of the dataset?