This is odd, the ordering of the files is different and there appears to be some missing from the preview. But as far as I can tell the files aren't different. What am I missing here?
The original file sizes are the same but the compressed sizes seem to be different.
Thanks for the reply @<1523701070390366208:profile|CostlyOstrich36> !
It says in the documentation that:
Add a folder into the current dataset. calculate file hash, and compare against parent, mark files to be uploaded
It seems to recognize the dataset as another version of the data but doesn't seem to be validating the hashes on a per file basis. Also, if you look at the photo, it seems like some of the data does get recognized as the same as the prior data. It seems like it's the correct operation but I'm happy to be wrong.
But if you have a suggestion of a better approach. update_changed_files doesn't seem to quite do it either because you need to add the directory first.
I have manually verified that the line-by-line content of the csv files is identical using hashlib.sha256(). Why would it be that the file content is the same, they are generated by the same process (literally just rerunning the same code twice) but ClearML treats them differently.
Alright, I tried testing it out by commenting out the code for generating new csv's, so for successive runs the CSVs are identical. However, when I use dataset.add_files() it still generated a new version of the dataset.
# log the data to ClearML if a task is passed
if self.task:
self.clearml_dataset = Dataset.create(dataset_name="[LTV] Dataset")
self.clearml_dataset.add_files(path=save_path, verbose=True)
if self.tags is not None:
self.clearml_dataset.add_tags(self.tags)
self.clearml_dataset.upload(
show_progress=True,
verbose=True,
)
self.task.connect(self.clearml_dataset)
self.clearml_dataset.finalize()
logger.info(f"Saved the data to ClearML.")
@<1545216070686609408:profile|EnthusiasticCow4> , I think add_files
always generates a new version. I mean, you add files to your dataset, so the version has changed. Does that make sense?
The verbose output:
Generating SHA2 hash for 123 files
100%|██████████████████████████████████████████████████████████| 123/123 [00:00<00:00, 310.04it/s]
Hash generation completed
Add 2022-12.csv
Add 2020-10.csv
Add 2021-06.csv
Add 2022-02.csv
Add 2021-04.csv
Add 2013-03.csv
Add 2021-02.csv
Add 2015-02.csv
Add 2016-07.csv
Add 2022-05.csv
Add 2021-10.csv
Add 2018-04.csv
Add 2019-06.csv
Add 2017-11.csv
Add 2016-01.csv
Add 2013-06.csv
Add 2018-08.csv
Add 2020-05.csv
Add 2020-03.csv
Add 2017-08.csv
Add 2020-01.csv
Add 2020-11.csv
Add 2019-02.csv
Add 2021-09.csv
Add 2014-03.csv
Add 2013-01.csv
Add 2016-09.csv
Add 2020-07.csv
Add 2020-12.csv
Add 2019-10.csv
Add 2013-05.csv
Add 2017-01.csv
Add 2015-05.csv
Add 2018-07.csv
Add 2015-04.csv
Add 2020-09.csv
Add 2015-12.csv
Add 2022-07.csv
Add 2021-12.csv
Add 2020-08.csv
Add 2016-06.csv
Add 2018-01.csv
Add 2015-08.csv
Add 2017-10.csv
Add 2014-11.csv
Add 2014-01.csv
Add 2016-05.csv
Add 2018-12.csv
Add 2022-01.csv
Add 2023-02.csv
Add 2016-12.csv
Add 2018-09.csv
Add 2018-05.csv
Add 2015-07.csv
Add 2012-12.csv
Add 2014-08.csv
Add 2017-12.csv
Add 2014-12.csv
Add 2022-06.csv
Add 2014-02.csv
Add 2021-07.csv
Add 2022-09.csv
Add 2014-06.csv
Add 2018-06.csv
Add 2019-11.csv
Add 2021-08.csv
Add 2016-11.csv
Add 2017-04.csv
Add 2018-02.csv
Add 2021-05.csv
Add 2017-06.csv
Add 2019-05.csv
Add 2015-10.csv
Add 2013-04.csv
Add 2022-11.csv
Add 2013-08.csv
Add 2014-05.csv
Add 2016-04.csv
Add 2021-03.csv
Add 2013-09.csv
Add 2018-03.csv
Add 2019-03.csv
Add 2015-11.csv
Add 2019-07.csv
Add 2021-01.csv
Add 2016-03.csv
Add 2019-04.csv
Add 2020-04.csv
Add 2020-06.csv
Add 2015-06.csv
Add 2013-10.csv
Add 2020-02.csv
Add 2021-11.csv
Add 2014-04.csv
Add 2018-10.csv
Add 2013-07.csv
Add 2015-09.csv
Add 2022-08.csv
Add 2017-02.csv
Add 2014-07.csv
Add 2014-10.csv
Add 2019-09.csv
Add 2023-01.csv
Add 2013-12.csv
Add 2017-09.csv
Add 2022-10.csv
Add 2017-07.csv
Add 2022-03.csv
Add 2019-12.csv
Add 2016-10.csv
Add 2013-11.csv
Add 2014-09.csv
Add 2019-08.csv
Add 2015-01.csv
Add 2019-01.csv
Add 2018-11.csv
Add 2017-03.csv
Add 2022-04.csv
Add 2016-08.csv
Add 2015-03.csv
Add 2016-02.csv
Add 2013-02.csv
Add 2017-05.csv
Compressing LTV/data/processed/2022-06.csv
Compressing LTV/data/processed/2022-05.csv
Compressing LTV/data/processed/2022-07.csv
Compressing LTV/data/processed/2022-08.csv
Compressing LTV/data/processed/2022-04.csv
Compressing LTV/data/processed/2022-10.csv
Compressing LTV/data/processed/2022-09.csv
Compressing LTV/data/processed/2022-11.csv
Compressing LTV/data/processed/2022-03.csv
Compressing LTV/data/processed/2022-12.csv
Compressing LTV/data/processed/2021-10.csv
Compressing LTV/data/processed/2019-06.csv
Compressing LTV/data/processed/2019-10.csv
Compressing LTV/data/processed/2019-05.csv
Compressing LTV/data/processed/2019-07.csv
Compressing LTV/data/processed/2023-01.csv
Compressing LTV/data/processed/2021-09.csv
Compressing LTV/data/processed/2019-08.csv
Compressing LTV/data/processed/2019-11.csv
Compressing LTV/data/processed/2019-04.csv
Compressing LTV/data/processed/2018-06.csv
Compressing LTV/data/processed/2019-12.csv
Compressing LTV/data/processed/2020-02.csv
Compressing LTV/data/processed/2019-09.csv
Compressing LTV/data/processed/2021-11.csv
Compressing LTV/data/processed/2018-07.csv
Compressing LTV/data/processed/2018-10.csv
Compressing LTV/data/processed/2019-03.csv
Compressing LTV/data/processed/2018-08.csv
Compressing LTV/data/processed/2018-05.csv
Compressing LTV/data/processed/2022-02.csv
Compressing LTV/data/processed/2017-10.csv
Compressing LTV/data/processed/2017-06.csv
Compressing LTV/data/processed/2018-11.csv
Compressing LTV/data/processed/2018-12.csv
Compressing LTV/data/processed/2019-02.csv
Compressing LTV/data/processed/2018-03.csv
Compressing LTV/data/processed/2020-01.csv
Compressing LTV/data/processed/2018-09.csv
Compressing LTV/data/processed/2018-04.csv
Compressing LTV/data/processed/2021-07.csv
Compressing LTV/data/processed/2021-08.csv
Compressing LTV/data/processed/2017-07.csv
Compressing LTV/data/processed/2017-11.csv
Compressing LTV/data/processed/2017-08.csv
Compressing LTV/data/processed/2020-03.csv
Compressing LTV/data/processed/2017-05.csv
Compressing LTV/data/processed/2017-12.csv
Compressing LTV/data/processed/2018-02.csv
Compressing LTV/data/processed/2017-04.csv
Compressing LTV/data/processed/2017-09.csv
Compressing LTV/data/processed/2019-01.csv
Compressing LTV/data/processed/2016-06.csv
Compressing LTV/data/processed/2016-10.csv
Compressing LTV/data/processed/2017-03.csv
Compressing LTV/data/processed/2016-08.csv
Compressing LTV/data/processed/2018-01.csv
Compressing LTV/data/processed/2016-05.csv
Compressing LTV/data/processed/2016-07.csv
Compressing LTV/data/processed/2021-12.csv
Compressing LTV/data/processed/2016-12.csv
Compressing LTV/data/processed/2016-11.csv
Compressing LTV/data/processed/2023-02.csv
Compressing LTV/data/processed/2016-04.csv
Compressing LTV/data/processed/2017-02.csv
Compressing LTV/data/processed/2021-06.csv
Compressing LTV/data/processed/2016-03.csv
Compressing LTV/data/processed/2016-09.csv
Compressing LTV/data/processed/2015-10.csv
Compressing LTV/data/processed/2015-06.csv
Compressing LTV/data/processed/2016-02.csv
Compressing LTV/data/processed/2015-07.csv
Compressing LTV/data/processed/2015-05.csv
Compressing LTV/data/processed/2017-01.csv
Compressing LTV/data/processed/2015-12.csv
Compressing LTV/data/processed/2015-08.csv
Compressing LTV/data/processed/2015-11.csvCompressing LTV/data/processed/2022-01.csvCompressing LTV/data/processed/2015-04.csv
Compressing LTV/data/processed/2015-09.csv
Compressing LTV/data/processed/2016-01.csv
Compressing LTV/data/processed/2014-08.csv
Compressing LTV/data/processed/2015-03.csv
Compressing LTV/data/processed/2014-10.csv
Compressing LTV/data/processed/2014-12.csv
Compressing LTV/data/processed/2014-07.csv
Compressing LTV/data/processed/2014-06.csv
Compressing LTV/data/processed/2015-02.csv
Compressing LTV/data/processed/2020-09.csv
Compressing LTV/data/processed/2020-07.csv
Compressing LTV/data/processed/2020-08.csv
Compressing LTV/data/processed/2014-11.csv
Compressing LTV/data/processed/2014-04.csv
Compressing LTV/data/processed/2014-09.csv
Compressing LTV/data/processed/2014-05.csv
Compressing LTV/data/processed/2015-01.csv
Compressing LTV/data/processed/2021-05.csv
Compressing LTV/data/processed/2020-10.csv
Compressing LTV/data/processed/2020-04.csv
Compressing LTV/data/processed/2014-03.csv
Compressing LTV/data/processed/2014-02.csvCompressing LTV/data/processed/2013-12.csv
Compressing LTV/data/processed/2013-10.csv
Compressing LTV/data/processed/2021-04.csv
Compressing LTV/data/processed/2020-06.csv
Compressing LTV/data/processed/2013-08.csvCompressing LTV/data/processed/2021-03.csv
Compressing LTV/data/processed/2013-11.csv
Compressing LTV/data/processed/2013-09.csv
Compressing LTV/data/processed/2020-05.csv
Compressing LTV/data/processed/2014-01.csv
Compressing LTV/data/processed/2013-07.csv
Compressing LTV/data/processed/2013-06.csv
Compressing LTV/data/processed/2021-02.csvCompressing LTV/data/processed/2020-11.csv
Compressing LTV/data/processed/2020-12.csv
Compressing LTV/data/processed/2021-01.csv
Compressing LTV/data/processed/2013-05.csv
Compressing LTV/data/processed/2013-04.csvCompressing LTV/data/processed/2013-03.csv
Compressing LTV/data/processed/2013-02.csv
Compressing LTV/data/processed/2012-12.csv
Compressing LTV/data/processed/2013-01.csv
Uploading dataset changes (123 files compressed to 427.3 MiB) to
Could it have to do with the fact that ClearML seems to 'adds' them in a different order?