Hi Everyone, Is It Possible To Not Create A Copy Of A Dataset When Adding To Clearml? My Data Is Already In A Directory On The Clearml-Server Machine And I Do Not Want To Copy It, Just Add It To Clearml As Dataset.

Answered

Hi everyone,
is it possible to not create a copy of a dataset when adding to clearml? My data is already in a directory on the clearml-server machine and I do not want to copy it, just add it to clearml as dataset.

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					ReassuredTiger98
				
					0
					 × 1

Votes Newest

Answers 24

My data is already in a directory on the clearml-server machine and I do not want to copy it, just add it to clearml as dataset.

So the short answer is, no, it needs to packager it (read "zip it")
The reason is clearml-data creates an Immutable copy, and just "pointing" to files located somewhere will usually break very easily.
That said, actually it will be relatively easy to add as dataset itself stores links to the files and these links could actually point to an S3 bucket (for example)
wdyt?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Yeah

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					VexedCat68
				
					0
					 × 1

Anyone wants to open a github issue, so we actually end up implementing it 😉 ?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

. I guess this can be built in as a feature into ClearML at some future point.

VexedCat68 you mean referencing an external link?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

How about instead of uploading the entire dataset to the clearml server, upload a text file with the location of the dataset on the machine. I would think that should do the trick.

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					VexedCat68
				
					0
					 × 1

but this is not different from not using clearml-data,

ReassuredTiger98 just making sure we are on the same page. clearml-data immutability is fixed, the user cannot change the content of the dataset (it is actually compressed and uploaded). If you want to change it, you create a new child version

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Yeah, that would ensure its clarity.

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					VexedCat68
				
					0
					 × 1

Sounds good. I think it is obvious that immutability has to be managed by the user then, but this is not different from not using clearml-data, so not a disadvantage in my opinion.

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					ReassuredTiger98
				
					0
					 × 1

And clearml-agent should pull these datasets from network storage...

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					ReassuredTiger98
				
					0
					 × 1

Good point

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					VexedCat68
				
					0
					 × 1

AgitatedDove14 SuccessfulKoala55 Could you briefly explain whether clearml supports no-copy add for datasets?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					ReassuredTiger98
				
					0
					 × 1

Yea, the real problem is that I have very large datasets in network storage. I am looking for a way to add the datasets on the networks storage as clearml-dataset.

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					ReassuredTiger98
				
					0
					 × 1

Yea, the clearml-data is immutable, but not the underlying data if I just store a pointer to some location.

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					ReassuredTiger98
				
					0
					 × 1

I understand your problem. I think you normally can specify where you want the data to be stored in a conf file somewhere. people here can better guide you. However in my experience, it kinda uploads the data and stores it in its own format.

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					VexedCat68
				
					0
					 × 1

Maybe a related question: Anyone every worked with datasets larger than the clearml-agent cache? Some colleague of mine has a dataset of ~ 1 tera byte...

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					ReassuredTiger98
				
					0
					 × 1

Yes, though the main caveat is the data is not really immutable 😞

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

I ll add creating an issue to my todo list

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					ReassuredTiger98
				
					0
					 × 1

Sounds like a good hack, but not like a good solution 😄 But thank you anyways! 🙂

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					ReassuredTiger98
				
					0
					 × 1

Yes, consider VexedCat68 txt file the Dataset "content" , this will enable ypu to safely get the list of files, and then you can use the StorageManager to download them extend this concept and have it built into the Dataset itself, i.e. allow you to add files as links and make sure it will just download them. The caveat here is that the Dataset at the end, returns a folder with the files, when you specify links, you have to also specify the target location locally (at the end you want a folder with everything there), make sense ?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Thank you for answering. So your suggestion would be similar to VexedCat68 's first idea, right?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					ReassuredTiger98
				
					0
					 × 1

AgitatedDove14 Your second option is somewhat like how shortcuts work right? Storing pointers to the actual data?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					VexedCat68
				
					0
					 × 1

VexedCat68 make sense, we could also (if implementing this feature) add a special Tag to the dataset , so you know it contains "external" links, wdyt?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

I normally just upload the data to the ClearML server and then remove it locally from my machine but I understand that isn't what you want. A quick hack was the only thing I could come up with at the moment xd. Anyway you're welcome. Hope you find a solution.

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					VexedCat68
				
					0
					 × 1

I understand that storing data outside ClearML won't ensure its immutability. I guess this can be built in as a feature into ClearML at some future point.

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					VexedCat68
				
					0
					 × 1

Write your answer

3K Views

24 Answers

3 years ago

2 years ago