Hi All, I'Ve Been Experimenting Around With Automating The Data Sync. This Is Related To This Thread:

Answered

Hi all,

I've been experimenting around with automating the data sync. This is related to this thread:

Here is my expectation, so correct me if I'm wrong:

I have a process for generating a dataset that's external to ClearML. For traceability, I want to register that dataset with ClearML. The process generates files, many of them won't change, however the dataset will be expanded over time so newer files will be added and on occasion old files might be updated. I would expect that ClearML would compare the hashes and filesnames, and only record the differences between the prior data and the updated dataset (I'm thinking like DVC does, where it tracks the changes to the dataset).

You can see in the link above I tried to experiment with this using Dataset.create() but it seems like whenever changes were made to the dataset it would create a new version (e.g. 1.0.1) and all of the data would be reregistered to the new dataset version, not just the differences between the datasets. This means that even if the dataset was completely unchanged ClearML would produce a new version of the data.

Today I've been experimenting with Dataset.get(). You can see in the following block of code:

if self.task:
            self.clearml_dataset = Dataset.get(
                dataset_name="[LTV] Dataset", auto_create=True
            )
            self.clearml_dataset.sync_folder(local_path=save_path, verbose=True)
            if self.tags is not None:
                self.clearml_dataset.add_tags(self.tags)
            self.task.connect(self.clearml_dataset)
            self.clearml_dataset.upload(
                show_progress=True,
                verbose=True,
            )
            if self.clearml_dataset.is_final is False:
                self.clearml_dataset.finalize()
            logger.info(f"Saved the data to ClearML.")

I've had some success and some failures. The success is that now, if I run this code for the first time (no dataset is created yet), it creates the version as expected. If I run the exact same code again, it doesn't attempt to create a new version of the dataset, instead it correctly verifies that all files exist on the remote ClearML server and it doesn't attempt to upload them. However, now, when I add a file that wasn't in the original dataset, it correctly identifies that the file isn't in the dataset, uploads it, but it preserves the version of the data (i.e. 1.0.0). How would I get it to increment the version automatically when new data is discovered and synced?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					EnthusiasticCow4
				
					0
					 × 1

Votes Newest

Answers 6

Is there any way to look at all the tasks that used that version of the dataset?
Not easily. You could query the runtime properties of all tasks and check for datasets used.
But what I would do is tag the task that uses a certain dataset, and then you should be able to query by tags

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					SmugDolphin23
				
					0

@<1545216070686609408:profile|EnthusiasticCow4>
This:

            parent = self.clearml_dataset = Dataset.get(
                dataset_name="[LTV] Dataset",
                dataset_project="[LTV] Lifetime Value Model",
            )
            # generate the local dataset
            dataset = Dataset.create(
                dataset_name=f"[LTV] Dataset",
                parent_datasets=[parent],
                dataset_project="[LTV] Lifetime Value Model",
            )

should likely be this:

            parent =  Dataset.get(
                dataset_name="[LTV] Dataset",
                dataset_project="[LTV] Lifetime Value Model",
            )
            # generate the local dataset
            dataset = self.clearml_dataset = Dataset.create(
                dataset_name=f"[LTV] Dataset",
                parent_datasets=[parent],
                dataset_project="[LTV] Lifetime Value Model",
            )

your new dataset would be the child, which then you will be able to upload and finalize.
Here, you should likely revert this:

            if not any(synced):
                Dataset.delete(dataset.id)
                self.task.connect(parent)
                logger.info(f"Data already exists on ClearML remote. Skipping upload.")
                self.clearml_dataset = parent  # notice this line

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					SmugDolphin23
				
					0

Interesting approach. I'll give that a try. Thanks for the reply!

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					EnthusiasticCow4
				
					0
					 × 1

Hi again @<1523701435869433856:profile|SmugDolphin23> ,

The approach you suggested seems to be working albeit with one issue. It does correctly identify the different versions of the dataset when new data is added, but I get an error when I try and finalize the dataset:

Code:

        if self.task:
            # get the parent dataset from the project
            parent = self.clearml_dataset = Dataset.get(
                dataset_name="[LTV] Dataset",
                dataset_project="[LTV] Lifetime Value Model",
            )
            # generate the local dataset
            dataset = Dataset.create(
                dataset_name=f"[LTV] Dataset",
                parent_datasets=[parent],
                dataset_project="[LTV] Lifetime Value Model",
            )
            # check to see what local files are different from the remote
            synced = self.clearml_dataset.sync_folder(
                local_path=save_path, verbose=True
            )
            # if there aren't any differences, skip adding the data to the remote and link the parent
            if not any(synced):
                Dataset.delete(dataset.id)
                self.task.connect(parent)
                logger.info(f"Data already exists on ClearML remote. Skipping upload.")
            # if there are differences, upload the data to the remote and link the new dataset
            else:
                if self.tags is not None:
                    self.clearml_dataset.add_tags(self.tags)
                self.task.connect(self.clearml_dataset)
                self.clearml_dataset.upload(
                    show_progress=True,
                    verbose=True,
                )
                self.clearml_dataset.finalize()
                logger.info(f"Saved the data to ClearML remote.")

Error generated:

self.clearml_dataset.finalize()
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/natephysics/anaconda3/envs/LTV/lib/python3.9/site-packages/clearml/datasets/dataset.py", line 780, in finalize
    raise ValueError("Cannot finalize dataset, status '{}' is not valid".format(status))
ValueError: Cannot finalize dataset, status 'completed' is not valid

It might be related to this output in the terminal:

Syncing folder data/processed : 0 files removed, 1 added / modified
Compressing /LTV/data/processed/2022-11.csv

Uploading dataset changes (1 files compressed to 4.2 MiB) to



2023-04-18 14:40:01,068 - clearml.Task - ERROR - Action failed <400/110: tasks.add_or_update_artifacts/v2.10 (Invalid task status: expected=created, status=completed)> (task=####, artifacts=[{'key': 'state', 'type': 'dict', 'uri': '

', 'content_size': 34672, 'hash': '####', 'timestamp': 1681821600, 'type_data': {'preview': 'Dataset state\nFiles added/modified: 120 - total size 497.24 MB\nCurrent dependency graph: {\n  "f047337f5b84428e9c53c0bf67915c46": []\n}\n', 'content_type': 'application/json'}, 'display_data': [('files added', '120'), ('files modified', '0'), ('files removed', '0')]}], force=True)

2023-04-18 14:40:02,317 - clearml.Task - ERROR - Action failed <400/110: tasks.add_or_update_artifacts/v2.10 (Invalid task status: expected=created, status=completed)> (task=####, artifacts=[{'key': 'state', 'type': 'dict', 'uri': '

', 'content_size': 34672, 'hash': '####', 'timestamp': 1681821600, 'type_data': {'preview': 'Dataset state\nFiles added/modified: 120 - total size 497.24 MB\nCurrent dependency graph: {\n  "f047337f5b84428e9c53c0bf67915c46": []\n}\n', 'content_type': 'application/json'}, 'display_data': [('files added', '120'), ('files modified', '0'), ('files removed', '0')]}, {'key': 'data_001', 'type': 'custom', 'uri': '

', 'content_size': 4403531, 'hash': '####', 'timestamp': 1681821601, 'type_data': {'preview': '2022-11.csv - 4.4 MB\n', 'content_type': 'application/zip'}}], force=True)
File compression and upload completed: total size 4.2 MiB, 1 chunk(s) stored (average size 4.2 MiB)

2023-04-18 14:40:02,980 - clearml.Task - ERROR - Action failed <400/110: tasks.add_or_update_artifacts/v2.10 (Invalid task status: expected=created, status=completed)> (task=f047337f5b84428e9c53c0bf67915c46, artifacts=[{'key': 'data_001', 'type': 'custom', 'uri': '

', 'content_size': 4403531, 'hash': '####', 'timestamp': 1681821601, 'type_data': {'preview': '2022-11.csv - 4.4 MB\n', 'content_type': 'application/zip'}}, {'key': 'state', 'type': 'dict', 'uri': '

', 'content_size': 34673, 'hash': '####', 'timestamp': 1681821602, 'type_data': {'preview': 'Dataset state\nFiles added/modified: 120 - total size 497.24 MB\nCurrent dependency graph: {\n  "f047337f5b84428e9c53c0bf67915c46": []\n}\n', 'content_type': 'application/json'}, 'display_data': [('files added', '120'), ('files modified', '0'), ('files removed', '0')]}], force=True)

I'm still trying to wrap my head around the intuition for "finalize". I assume finalize would lock in the dataset for that task/application of the data. I'm not sure why in this case I would be unable to finalize the dataset.

On a related question, am I correctly "linking" the dataset to the task? If I look at the PROJECTS / task in question, under the info tab I to get a keyword for datasets with the ID, but there doesn't appear to be any link from the project to the dataset. I can, of course, go to the dataset and search for that ID to find it. I'm still learning the interface and I'm curious if there's a "link" between the task and the data? Is there any way to look at all the tasks that used that version of the dataset?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					EnthusiasticCow4
				
					0
					 × 1

Hi @<1545216070686609408:profile|EnthusiasticCow4> ! I have an idea.
The flow would be like this: you create a dataset, the parent of that dataset would be the previously created dataset. The version will auto-bump. Then, you sync this dataset with the folder. Note that sync will return the number of added/modified/removed files. If all of these are 0, then you use Dataset.delete on this dataset and break/continue, else you upload and finalize the dataset.

Something like:

parent = Dataset.get(dataset_name="[LTV] Dataset")
dataset = Dataset.create(..., parents=[parent.id])
synced = dataset.sync_folder(local_path=folder)
if not any(synced):
  Dataset.delete(dataset.id)
  return
dataset.connect/add_tags/upload()
dataset.finalize()

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					SmugDolphin23
				
					0

Let me give that a try. Thanks for all the help.

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					EnthusiasticCow4
				
					0
					 × 1

Write your answer

2K Views

6 Answers

2 years ago