In My Current Project I Generate The Data From An Sql Query. Is The Only Way To Register The Dataset With Clearml To Write The Files To Disk First Or Is There Another Method? This Leads Into The Second Issue I Have, Which Is What Happens When I Store The

Answered

In my current project I generate the data from an SQL query. Is the only way to register the dataset with ClearML to write the files to disk first or is there another method?

This leads into the second issue I have, which is what happens when I store the SQL query onto disk. Instead of having to store the entire dataset each time more data is added I'm breaking the files up into a series of csv's (1 months data per csv). This way historical data should only update on rare occasions (when there's some change in the db, but it's rare after the first month). However, this isn't working for me. I first group the data by the time segment:

# group the data by the time segment
        grouped_data = self.data.groupby(
            pd.Grouper(key="transaction_date", freq=time_segemnt)
        )

Then I write out the csvs to an empty directory.

# save each group to a csv file
        for name, group in grouped_data:
            try:
                group.to_csv(f"{save_path}/{name::%Y-%m-%d}.csv", index=False).encode(
                    "utf-8"
                )
            except AttributeError:
                pass

But for some reason when I do this the hash of the file changes. I checked the file contents with filecompare and they are identical but the hash is different. afaik hashing ignores the metadata so I'm not sure why the hash would be different. Any ideas? Or is there just a better approach I should be taking?

  				
Posted 
	one year ago

					More  		
  Report
		
					EnthusiasticCow4
				
					0
					 × 1

Votes Newest

Answers 7

This is odd, the ordering of the files is different and there appears to be some missing from the preview. But as far as I can tell the files aren't different. What am I missing here?

  				
Posted 
	one year ago

					More  		
  Report
		
					EnthusiasticCow4
				
					0
					 × 1

Thanks for the reply CostlyOstrich36 !

It says in the documentation that:
Add a folder into the current dataset. calculate file hash, and compare against parent, mark files to be uploaded

It seems to recognize the dataset as another version of the data but doesn't seem to be validating the hashes on a per file basis. Also, if you look at the photo, it seems like some of the data does get recognized as the same as the prior data. It seems like it's the correct operation but I'm happy to be wrong.

But if you have a suggestion of a better approach. update_changed_files doesn't seem to quite do it either because you need to add the directory first.

  				
Posted 
	one year ago

					More  		
  Report
		
					EnthusiasticCow4
				
					0
					 × 1

The verbose output:

Generating SHA2 hash for 123 files
100%|██████████████████████████████████████████████████████████| 123/123 [00:00<00:00, 310.04it/s]
Hash generation completed
Add 2022-12.csv
Add 2020-10.csv
Add 2021-06.csv
Add 2022-02.csv
Add 2021-04.csv
Add 2013-03.csv
Add 2021-02.csv
Add 2015-02.csv
Add 2016-07.csv
Add 2022-05.csv
Add 2021-10.csv
Add 2018-04.csv
Add 2019-06.csv
Add 2017-11.csv
Add 2016-01.csv
Add 2013-06.csv
Add 2018-08.csv
Add 2020-05.csv
Add 2020-03.csv
Add 2017-08.csv
Add 2020-01.csv
Add 2020-11.csv
Add 2019-02.csv
Add 2021-09.csv
Add 2014-03.csv
Add 2013-01.csv
Add 2016-09.csv
Add 2020-07.csv
Add 2020-12.csv
Add 2019-10.csv
Add 2013-05.csv
Add 2017-01.csv
Add 2015-05.csv
Add 2018-07.csv
Add 2015-04.csv
Add 2020-09.csv
Add 2015-12.csv
Add 2022-07.csv
Add 2021-12.csv
Add 2020-08.csv
Add 2016-06.csv
Add 2018-01.csv
Add 2015-08.csv
Add 2017-10.csv
Add 2014-11.csv
Add 2014-01.csv
Add 2016-05.csv
Add 2018-12.csv
Add 2022-01.csv
Add 2023-02.csv
Add 2016-12.csv
Add 2018-09.csv
Add 2018-05.csv
Add 2015-07.csv
Add 2012-12.csv
Add 2014-08.csv
Add 2017-12.csv
Add 2014-12.csv
Add 2022-06.csv
Add 2014-02.csv
Add 2021-07.csv
Add 2022-09.csv
Add 2014-06.csv
Add 2018-06.csv
Add 2019-11.csv
Add 2021-08.csv
Add 2016-11.csv
Add 2017-04.csv
Add 2018-02.csv
Add 2021-05.csv
Add 2017-06.csv
Add 2019-05.csv
Add 2015-10.csv
Add 2013-04.csv
Add 2022-11.csv
Add 2013-08.csv
Add 2014-05.csv
Add 2016-04.csv
Add 2021-03.csv
Add 2013-09.csv
Add 2018-03.csv
Add 2019-03.csv
Add 2015-11.csv
Add 2019-07.csv
Add 2021-01.csv
Add 2016-03.csv
Add 2019-04.csv
Add 2020-04.csv
Add 2020-06.csv
Add 2015-06.csv
Add 2013-10.csv
Add 2020-02.csv
Add 2021-11.csv
Add 2014-04.csv
Add 2018-10.csv
Add 2013-07.csv
Add 2015-09.csv
Add 2022-08.csv
Add 2017-02.csv
Add 2014-07.csv
Add 2014-10.csv
Add 2019-09.csv
Add 2023-01.csv
Add 2013-12.csv
Add 2017-09.csv
Add 2022-10.csv
Add 2017-07.csv
Add 2022-03.csv
Add 2019-12.csv
Add 2016-10.csv
Add 2013-11.csv
Add 2014-09.csv
Add 2019-08.csv
Add 2015-01.csv
Add 2019-01.csv
Add 2018-11.csv
Add 2017-03.csv
Add 2022-04.csv
Add 2016-08.csv
Add 2015-03.csv
Add 2016-02.csv
Add 2013-02.csv
Add 2017-05.csv
Compressing LTV/data/processed/2022-06.csv
Compressing LTV/data/processed/2022-05.csv
Compressing LTV/data/processed/2022-07.csv
Compressing LTV/data/processed/2022-08.csv
Compressing LTV/data/processed/2022-04.csv
Compressing LTV/data/processed/2022-10.csv
Compressing LTV/data/processed/2022-09.csv
Compressing LTV/data/processed/2022-11.csv
Compressing LTV/data/processed/2022-03.csv
Compressing LTV/data/processed/2022-12.csv
Compressing LTV/data/processed/2021-10.csv
Compressing LTV/data/processed/2019-06.csv
Compressing LTV/data/processed/2019-10.csv
Compressing LTV/data/processed/2019-05.csv
Compressing LTV/data/processed/2019-07.csv
Compressing LTV/data/processed/2023-01.csv
Compressing LTV/data/processed/2021-09.csv
Compressing LTV/data/processed/2019-08.csv
Compressing LTV/data/processed/2019-11.csv
Compressing LTV/data/processed/2019-04.csv
Compressing LTV/data/processed/2018-06.csv
Compressing LTV/data/processed/2019-12.csv
Compressing LTV/data/processed/2020-02.csv
Compressing LTV/data/processed/2019-09.csv
Compressing LTV/data/processed/2021-11.csv
Compressing LTV/data/processed/2018-07.csv
Compressing LTV/data/processed/2018-10.csv
Compressing LTV/data/processed/2019-03.csv
Compressing LTV/data/processed/2018-08.csv
Compressing LTV/data/processed/2018-05.csv
Compressing LTV/data/processed/2022-02.csv
Compressing LTV/data/processed/2017-10.csv
Compressing LTV/data/processed/2017-06.csv
Compressing LTV/data/processed/2018-11.csv
Compressing LTV/data/processed/2018-12.csv
Compressing LTV/data/processed/2019-02.csv
Compressing LTV/data/processed/2018-03.csv
Compressing LTV/data/processed/2020-01.csv
Compressing LTV/data/processed/2018-09.csv
Compressing LTV/data/processed/2018-04.csv
Compressing LTV/data/processed/2021-07.csv
Compressing LTV/data/processed/2021-08.csv
Compressing LTV/data/processed/2017-07.csv
Compressing LTV/data/processed/2017-11.csv
Compressing LTV/data/processed/2017-08.csv
Compressing LTV/data/processed/2020-03.csv
Compressing LTV/data/processed/2017-05.csv
Compressing LTV/data/processed/2017-12.csv
Compressing LTV/data/processed/2018-02.csv
Compressing LTV/data/processed/2017-04.csv
Compressing LTV/data/processed/2017-09.csv
Compressing LTV/data/processed/2019-01.csv
Compressing LTV/data/processed/2016-06.csv
Compressing LTV/data/processed/2016-10.csv
Compressing LTV/data/processed/2017-03.csv
Compressing LTV/data/processed/2016-08.csv
Compressing LTV/data/processed/2018-01.csv
Compressing LTV/data/processed/2016-05.csv
Compressing LTV/data/processed/2016-07.csv
Compressing LTV/data/processed/2021-12.csv
Compressing LTV/data/processed/2016-12.csv
Compressing LTV/data/processed/2016-11.csv
Compressing LTV/data/processed/2023-02.csv
Compressing LTV/data/processed/2016-04.csv
Compressing LTV/data/processed/2017-02.csv
Compressing LTV/data/processed/2021-06.csv
Compressing LTV/data/processed/2016-03.csv
Compressing LTV/data/processed/2016-09.csv
Compressing LTV/data/processed/2015-10.csv
Compressing LTV/data/processed/2015-06.csv
Compressing LTV/data/processed/2016-02.csv
Compressing LTV/data/processed/2015-07.csv
Compressing LTV/data/processed/2015-05.csv
Compressing LTV/data/processed/2017-01.csv
Compressing LTV/data/processed/2015-12.csv
Compressing LTV/data/processed/2015-08.csv
Compressing LTV/data/processed/2015-11.csvCompressing LTV/data/processed/2022-01.csvCompressing LTV/data/processed/2015-04.csv


Compressing LTV/data/processed/2015-09.csv
Compressing LTV/data/processed/2016-01.csv
Compressing LTV/data/processed/2014-08.csv
Compressing LTV/data/processed/2015-03.csv
Compressing LTV/data/processed/2014-10.csv
Compressing LTV/data/processed/2014-12.csv
Compressing LTV/data/processed/2014-07.csv
Compressing LTV/data/processed/2014-06.csv
Compressing LTV/data/processed/2015-02.csv
Compressing LTV/data/processed/2020-09.csv
Compressing LTV/data/processed/2020-07.csv
Compressing LTV/data/processed/2020-08.csv
Compressing LTV/data/processed/2014-11.csv
Compressing LTV/data/processed/2014-04.csv
Compressing LTV/data/processed/2014-09.csv
Compressing LTV/data/processed/2014-05.csv
Compressing LTV/data/processed/2015-01.csv
Compressing LTV/data/processed/2021-05.csv
Compressing LTV/data/processed/2020-10.csv
Compressing LTV/data/processed/2020-04.csv
Compressing LTV/data/processed/2014-03.csv
Compressing LTV/data/processed/2014-02.csvCompressing LTV/data/processed/2013-12.csv

Compressing LTV/data/processed/2013-10.csv
Compressing LTV/data/processed/2021-04.csv
Compressing LTV/data/processed/2020-06.csv
Compressing LTV/data/processed/2013-08.csvCompressing LTV/data/processed/2021-03.csv

Compressing LTV/data/processed/2013-11.csv
Compressing LTV/data/processed/2013-09.csv
Compressing LTV/data/processed/2020-05.csv
Compressing LTV/data/processed/2014-01.csv
Compressing LTV/data/processed/2013-07.csv
Compressing LTV/data/processed/2013-06.csv
Compressing LTV/data/processed/2021-02.csvCompressing LTV/data/processed/2020-11.csv

Compressing LTV/data/processed/2020-12.csv
Compressing LTV/data/processed/2021-01.csv
Compressing LTV/data/processed/2013-05.csv
Compressing LTV/data/processed/2013-04.csvCompressing LTV/data/processed/2013-03.csv

Compressing LTV/data/processed/2013-02.csv
Compressing LTV/data/processed/2012-12.csv
Compressing LTV/data/processed/2013-01.csv
Uploading dataset changes (123 files compressed to 427.3 MiB) to

Could it have to do with the fact that ClearML seems to 'adds' them in a different order?

  				
Posted 
	one year ago

					More  		
  Report
		
					EnthusiasticCow4
				
					0
					 × 1

Alright, I tried testing it out by commenting out the code for generating new csv's, so for successive runs the CSVs are identical. However, when I use dataset.add_files() it still generated a new version of the dataset.

# log the data to ClearML if a task is passed
        if self.task:
            self.clearml_dataset = Dataset.create(dataset_name="[LTV] Dataset")
            self.clearml_dataset.add_files(path=save_path, verbose=True)
            if self.tags is not None:
                self.clearml_dataset.add_tags(self.tags)
            self.clearml_dataset.upload(
                show_progress=True,
                verbose=True,
            )
            self.task.connect(self.clearml_dataset)
            self.clearml_dataset.finalize()
            logger.info(f"Saved the data to ClearML.")

  				
Posted 
	one year ago

					More  		
  Report
		
					EnthusiasticCow4
				
					0
					 × 1

The original file sizes are the same but the compressed sizes seem to be different.

  				
Posted 
	one year ago

					More  		
  Report
		
					EnthusiasticCow4
				
					0
					 × 1

I have manually verified that the line-by-line content of the csv files is identical using hashlib.sha256(). Why would it be that the file content is the same, they are generated by the same process (literally just rerunning the same code twice) but ClearML treats them differently.

  				
Posted 
	one year ago

					More  		
  Report
		
					EnthusiasticCow4
				
					0
					 × 1

EnthusiasticCow4 , I think add_files always generates a new version. I mean, you add files to your dataset, so the version has changed. Does that make sense?

  				
Posted 
	one year ago

					More  		
  Report
		
					CostlyOstrich36
				
					0

Write your answer

1K Views

7 Answers

one year ago