Hi, What Is The Right Way Of Syncing A Dataset? Whenever I Add New Archives And Try To Upload I Get:

Answered

Hi, what is the right way of syncing a dataset?
Whenever I add new archives and try to upload I get: Error: Task object can only be updated if created or in_progress
I have created the dataset, synced a folder, updated the files in the folder, then what command should I call? Tried upload and sync again but I always get the message above.

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SkinnyPanda43
				
					0
					 × 1

Votes Newest

Answers 7

Hi SkinnyPanda43
Every "commit" is a new version, so sync changes you need to either create a new version (with parent version as the previous one), and sync the local folder (or manually add/remove files).
If you do not need to actually store the "current" version, you can just reset the Task, and sync it again.
wdyt?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Thank you for your response, so what is the difference between sync and add? By your description it seems to make no difference whether I added the files via sync or add, since I will have to create a new dataset either way.

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SkinnyPanda43
				
					0
					 × 1

By your description it seems to make no difference whether I added the files via sync or add, since I will have to create a new dataset either way.

Sync is design to take a local folder/s and add/remove files from a dataset based on the local changes (it does that automatically based on file existence / content)
The changes (i.e. added files) are uploaded as delta changes relative to the parent version, this means we are not always uploading all files.

Add on the other hand means you know already which files are added to the dataset and these files only will be added to the datasets (again relative to the parent version). Notice that here also if you are adding files with the same content as files in the parent version, they will not be uploaded twice.

Make sense ?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Let's see if I got how it works on the CLI.
So if I execute:
clearml-data create --name <improved_dataset> --parents <existing_dataset_id>
Where the parent dataset was updated with sync,
I just need to run:
clearml-data upload --id <created_dataset_id>
And the delta will be automatically uploaded to the new dataset?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SkinnyPanda43
				
					0
					 × 1

I run some tests, I think I got it now.
After creating the new dataset, it is necessary to run sync again, but now only the new files are uploaded.

And when running get the files on the parent dataset will be available as links.

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SkinnyPanda43
				
					0
					 × 1

Correct 🙂

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

And when running

get

the files on the parent dataset will be available as links.

BTW: if you call get_mutable_copy() the files will be copied, so you can work on them directly (if you need)

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Write your answer

1K Views

7 Answers

3 years ago

one year ago