s there any way to see datasets uploaded to ClearML Data without downloading them using ClearML Data?
Hi VexedCat68
Currently when you create datasets with clearml-data it has to repackage your files, i.e. upload them. That said we have received numerous requests on "registering data", and we are looking into it.
Here is the main technical hurdles we are facing, and I would love to get your perspective:
If the data is not available locally, we cannot calculate the hash of the content, that means there is no verification on the consistency We usually do have a way to get the file size, but in some scenarios this is also not possible The assumption is the data packaged by clearml-data will stay intact (immutable), there is very little guarantee when just "registering links" In terms of interface, if this is "object storage" I think that matching the current interface (i.e. passing a bucket/folder) would make sense, what do you think?
I'm not quite sure, I'll need to double check 🙂
Also, do I have to manually keep track of dataset versions in a separate database? Or am I provided that as well in ClearML?
I'm not in the best position to answer these questions right now.
That but also in proper directory on the File System
Still unsure between finalize and publish? Since upload should upload the data to the server
We want to get a clearer picture here to compare versioning with ClearML Data vs our own custom versioning
Also what's the difference between Finalize vs Publish?
Like there are files in a specific folder on Machine A. A script on Machine A, creates a Dataset, adds files located in that folder, and publishes it. Now can you look at that dataset on the server machine? Not from the ClearML interface but inside normal directories, like in /opt/clearml etc. this directory mentioned is just an example.
Regarding viewing the datasets - Can you give an example? I'm not sure I understand how you'd like to view it
Regarding Publish vs Finalize - I think finalize uploads all the files and prepares it for publish. Once published, it should be accessible to other parts(tasks) in the system
So I got my answer, for the first one. I found where the data is stored in the server