Hi guys, im running into an issue when creating new clearml dataset version. I want to create new version of dataset from a local folder, but I don't want all of the files in the folder to be included, so I cannot use dataset.sync_folder()
. Instead I'm removing all the files with dataset.remove_files()
and then adding them all with dataset.add_files()
. If I then list added modified and removed files with dataset.list_added_files()
etc they return correct results (only files that were actually modified etc.)
But when uploading, even files that were not modified or added are being compressed and uploaded. ClearML UI then lists all of the dataset files as modified. I would like to only upload and store the actually modified files.
To reproduce
Setup a folder with atleast 2 files
# create new dataset
dataset = clearml.Dataset.create(dataset_project=project,
dataset_name=self.header['Name'],
dataset_version=new_version,
parent_datasets=parent_ids
)
# remove all folders
dataset.remove_files(folder + "/*")
# add the files we want to include
dataset.add_files(os.path.join(local_path, folder), dataset_path=folder)
# this works as expected
added = dataset.list_added_files()
removed = dataset.list_removed_files()
modified = dataset.list_modified_files()
#upload and finalize
dataset.upload(verbose=True, max_workers=1)
dataset.finalize(verbose=True)
Modify one file and repeat the steps above. added
, removed
and modified
will have correct values but clearml will upload both the files and the ui will report both of the files as modified. Is there another way to sync only part of a folder with a dataset?

