Im Having Difficuilty Understanding How To Handle Modified Files On S3

Answered

Im having difficuilty understanding how to handle modified files on S3

I have a file None , its relative_path is /raw/a.png Stored on clearml Dataset "A" (added with add_external_files)
I make a dataset "B" and its parent is now "A"
I resize the image and store it in None , its relative path is the same /raw/a.png Stored on clearml Dataset "B" (added with add_external_files)
When I look in clearml UI, it shows that 1 new file was added instead of modified
Have I misunderstand something? I get that there are technically 2 files, but they have the same relative_path (I get it from LinkEntry objects)

This also introduces a bug?
when I do clearml.Dataset.get("id").list_files()
it now shows:
raw/a.png
raw/a.png/a.png

  				
Posted 
	one year ago

					More
				  		
  Report
		
					AmiableSeaturtle81
				
					0
					 × 1

Votes Newest

Answers 3

When I look at LinkEntry object, link property is correct, no duplicates. Its relative_path thats duped and also key name in _dataset_link_entries

  				
Posted 
	one year ago

					More
				  		
  Report
		
					AmiableSeaturtle81
				
					0
					 × 1

ok, then, I have a solution, but it still makes duplicate names

new_dataset._dataset_link_entries = {} # Cleaning all raw/a.png files
resize a.png and save it in another location named a_resized.png
Add back other files i need (excluding raw/a.png), I add them to new_dataset._ dataset_link_entries
Use add_external_files to include it in dataset. Im also using dataset_path=[a list of relative paths]
What I would expect:
100 Files removed (all a.png)
100 Files added (all a_resized.png)

What I get:

when doing new_dataset.list_files() it now returns me these double filenames: raw/a_resized.jpg/a_resized.jpg
Whats up with this?
Already checked all paths, i do not at any time pass double named files

  				
Posted 
	one year ago

					More
				  		
  Report
		
					AmiableSeaturtle81
				
					0
					 × 1

Hi @<1590514584836378624:profile|AmiableSeaturtle81> , the reason for this is because each file is hashed and this is how the feature compares between versions. If you're looking to keep track of specific links then the HyperDatasets might be what you're looking for - None

  				
Posted 
	one year ago

					More
				  		
  Report
		
					CostlyOstrich36
				
					0

Write your answer

1K Views

3 Answers

one year ago