Hi All, I'Ve Been Experimenting Around With Automating The Data Sync. This Is Related To This Thread:

Unanswered

The approach you suggested seems to be working albeit with one issue. It does correctly identify the different versions of the dataset when new data is added, but I get an error when I try and finalize the dataset:

Code:

        if self.task:
            # get the parent dataset from the project
            parent = self.clearml_dataset = Dataset.get(
                dataset_name="[LTV] Dataset",
                dataset_project="[LTV] Lifetime Value Model",
            )
            # generate the local dataset
            dataset = Dataset.create(
                dataset_name=f"[LTV] Dataset",
                parent_datasets=[parent],
                dataset_project="[LTV] Lifetime Value Model",
            )
            # check to see what local files are different from the remote
            synced = self.clearml_dataset.sync_folder(
                local_path=save_path, verbose=True
            )
            # if there aren't any differences, skip adding the data to the remote and link the parent
            if not any(synced):
                Dataset.delete(dataset.id)
                self.task.connect(parent)
                logger.info(f"Data already exists on ClearML remote. Skipping upload.")
            # if there are differences, upload the data to the remote and link the new dataset
            else:
                if self.tags is not None:
                    self.clearml_dataset.add_tags(self.tags)
                self.task.connect(self.clearml_dataset)
                self.clearml_dataset.upload(
                    show_progress=True,
                    verbose=True,
                )
                self.clearml_dataset.finalize()
                logger.info(f"Saved the data to ClearML remote.")

Error generated:

self.clearml_dataset.finalize()
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/natephysics/anaconda3/envs/LTV/lib/python3.9/site-packages/clearml/datasets/dataset.py", line 780, in finalize
    raise ValueError("Cannot finalize dataset, status '{}' is not valid".format(status))
ValueError: Cannot finalize dataset, status 'completed' is not valid

It might be related to this output in the terminal:

Syncing folder data/processed : 0 files removed, 1 added / modified
Compressing /LTV/data/processed/2022-11.csv

Uploading dataset changes (1 files compressed to 4.2 MiB) to



2023-04-18 14:40:01,068 - clearml.Task - ERROR - Action failed <400/110: tasks.add_or_update_artifacts/v2.10 (Invalid task status: expected=created, status=completed)> (task=####, artifacts=[{'key': 'state', 'type': 'dict', 'uri': '

', 'content_size': 34672, 'hash': '####', 'timestamp': 1681821600, 'type_data': {'preview': 'Dataset state\nFiles added/modified: 120 - total size 497.24 MB\nCurrent dependency graph: {\n  "f047337f5b84428e9c53c0bf67915c46": []\n}\n', 'content_type': 'application/json'}, 'display_data': [('files added', '120'), ('files modified', '0'), ('files removed', '0')]}], force=True)

2023-04-18 14:40:02,317 - clearml.Task - ERROR - Action failed <400/110: tasks.add_or_update_artifacts/v2.10 (Invalid task status: expected=created, status=completed)> (task=####, artifacts=[{'key': 'state', 'type': 'dict', 'uri': '

', 'content_size': 34672, 'hash': '####', 'timestamp': 1681821600, 'type_data': {'preview': 'Dataset state\nFiles added/modified: 120 - total size 497.24 MB\nCurrent dependency graph: {\n  "f047337f5b84428e9c53c0bf67915c46": []\n}\n', 'content_type': 'application/json'}, 'display_data': [('files added', '120'), ('files modified', '0'), ('files removed', '0')]}, {'key': 'data_001', 'type': 'custom', 'uri': '

', 'content_size': 4403531, 'hash': '####', 'timestamp': 1681821601, 'type_data': {'preview': '2022-11.csv - 4.4 MB\n', 'content_type': 'application/zip'}}], force=True)
File compression and upload completed: total size 4.2 MiB, 1 chunk(s) stored (average size 4.2 MiB)

2023-04-18 14:40:02,980 - clearml.Task - ERROR - Action failed <400/110: tasks.add_or_update_artifacts/v2.10 (Invalid task status: expected=created, status=completed)> (task=f047337f5b84428e9c53c0bf67915c46, artifacts=[{'key': 'data_001', 'type': 'custom', 'uri': '

', 'content_size': 4403531, 'hash': '####', 'timestamp': 1681821601, 'type_data': {'preview': '2022-11.csv - 4.4 MB\n', 'content_type': 'application/zip'}}, {'key': 'state', 'type': 'dict', 'uri': '

', 'content_size': 34673, 'hash': '####', 'timestamp': 1681821602, 'type_data': {'preview': 'Dataset state\nFiles added/modified: 120 - total size 497.24 MB\nCurrent dependency graph: {\n  "f047337f5b84428e9c53c0bf67915c46": []\n}\n', 'content_type': 'application/json'}, 'display_data': [('files added', '120'), ('files modified', '0'), ('files removed', '0')]}], force=True)

I'm still trying to wrap my head around the intuition for "finalize". I assume finalize would lock in the dataset for that task/application of the data. I'm not sure why in this case I would be unable to finalize the dataset.

On a related question, am I correctly "linking" the dataset to the task? If I look at the PROJECTS / task in question, under the info tab I to get a keyword for datasets with the ID, but there doesn't appear to be any link from the project to the dataset. I can, of course, go to the dataset and search for that ID to find it. I'm still learning the interface and I'm curious if there's a "link" between the task and the data? Is there any way to look at all the tasks that used that version of the dataset?

  				
Posted 
	one year ago

					More  		
  Report
		
					EnthusiasticCow4
				
					0
					 × 1

239 Views

0 Answers

one year ago