we want to use the dataset output_uri as a common ground to create additional dataset formats such as https://webdataset.github.io/webdataset/
OutrageousSheep60 before I can answer, maybe you can explain why "zipping" them does not fit your workfow ?
Hi OutrageousSheep60
AS-IS
- without compressing or breaking it up into chunks.
So for that I would suggest to manually archive it, and upload as external link?
Or are you saying you want to control the compression used by Dataset class ?
https://github.com/allegroai/clearml/blob/72d9b22e0d27f317a364acfeacbcf5c70f852e8c/clearml/datasets/dataset.py#L603
I think the main difference is that I can see a value of having access to the raw format within the cloud vendor and not only have it as an archive
I see it does make sense.
Two options, one, as you mentioned use the ClearML StorageManager to upload the files, then register them as external links with Dataset.
Two, I know the enterprise tier has HyperDatasets, that are essentially what you describe, with version control over the "metadata" and "raw storage" on the GCP, including the ability to review the files from the web UI. Unfortunately there is no direct equivalent in the open-source version
In order to create a webdataset
we need to create tar files -
so we need to unzip and then recreate the tar file.
Additionally when the files are in GCS in the raw format you can easily review them with the preview (e.g. a wav file can be directly listened within the GCP console - web browser).
I think the main difference is that I can see a value of having access to the raw format within the cloud vendor and not only have it as an archive
OutrageousSheep60 so this should work, no?ds.upload(output_url='gs://<BUCKET>/', compression=0, chunk_size=100000000000)
Notice the chunk size is the maximum size (in bytes) per chunk, so it should basically very large
OutrageousSheep60 so if this is the case I think you need to add "external links" i.e. upload the individual files to GCS, then register the links to GCS, does that make sense ?
This does not work -
Since all the files are stored as a single ZIP file (which if unzipped will have all the data), but we would like to have access to the raw files in there original format.
That is a workaround - but surly not optimal
If we want to generate a dataset from a set of files that are on a local computer (e.g. a local GPU workstation then ran some media transformation) -
then instead of creating the Dataset
directly - we need to first upload them and only then use the ClearML
sdk.
Do you see any option integrating this kind of workflow into clearml?