AgitatedDove14 Your second option is somewhat like how shortcuts work right? Storing pointers to the actual data?
Yea, the clearml-data is immutable, but not the underlying data if I just store a pointer to some location.
Yea, the real problem is that I have very large datasets in network storage. I am looking for a way to add the datasets on the networks storage as clearml-dataset.
I understand that storing data outside ClearML won't ensure its immutability. I guess this can be built in as a feature into ClearML at some future point.
How about instead of uploading the entire dataset to the clearml server, upload a text file with the location of the dataset on the machine. I would think that should do the trick.
And clearml-agent should pull these datasets from network storage...
I normally just upload the data to the ClearML server and then remove it locally from my machine but I understand that isn't what you want. A quick hack was the only thing I could come up with at the moment xd. Anyway you're welcome. Hope you find a solution.
AgitatedDove14 SuccessfulKoala55 Could you briefly explain whether clearml supports no-copy add for datasets?
Sounds like a good hack, but not like a good solution 😄 But thank you anyways! 🙂
Sounds good. I think it is obvious that immutability has to be managed by the user then, but this is not different from not using clearml-data, so not a disadvantage in my opinion.
VexedCat68 make sense, we could also (if implementing this feature) add a special Tag to the dataset , so you know it contains "external" links, wdyt?
. I guess this can be built in as a feature into ClearML at some future point.
VexedCat68 you mean referencing an external link?
Thank you for answering. So your suggestion would be similar to VexedCat68 's first idea, right?
but this is not different from not using clearml-data,
ReassuredTiger98 just making sure we are on the same page. clearml-data immutability is fixed, the user cannot change the content of the dataset (it is actually compressed and uploaded). If you want to change it, you create a new child version
I understand your problem. I think you normally can specify where you want the data to be stored in a conf file somewhere. people here can better guide you. However in my experience, it kinda uploads the data and stores it in its own format.
Yes, though the main caveat is the data is not really immutable 😞
Yes, consider VexedCat68 txt file the Dataset "content" , this will enable ypu to safely get the list of files, and then you can use the StorageManager to download them extend this concept and have it built into the Dataset itself, i.e. allow you to add files as links and make sure it will just download them. The caveat here is that the Dataset at the end, returns a folder with the files, when you specify links, you have to also specify the target location locally (at the end you want a folder with everything there), make sense ?
My data is already in a directory on the clearml-server machine and I do not want to copy it, just add it to clearml as dataset.
So the short answer is, no, it needs to packager it (read "zip it")
The reason is clearml-data creates an Immutable copy, and just "pointing" to files located somewhere will usually break very easily.
That said, actually it will be relatively easy to add as dataset itself stores links to the files and these links could actually point to an S3 bucket (for example)
wdyt?
I ll add creating an issue to my todo list
Anyone wants to open a github issue, so we actually end up implementing it 😉 ?
Maybe a related question: Anyone every worked with datasets larger than the clearml-agent cache? Some colleague of mine has a dataset of ~ 1 tera byte...