Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hi, We Have A Use Case That We Would Like To Upload A Local Folder Into The Cloud

Hi,
We have a use case that we would like to upload a local folder into the cloud AS-IS - without compressing or breaking it up into chunks.
I tried running the upload command as follows -
ds.upload(output_url='gs://<BUCKET>/', compression=0, chunk_size=1)but the result is that each file is converted into a single folder with the a zipfile.
I'm guessing that the solution would require to pass to the ParallelZipper a different object (instead of the https://github.com/allegroai/clearml/blob/0e283dd514bce2366584435a91c2ffa95340343b/clearml/utilities/parallel.py#L192 )
Is this the correct approach?

  
  
Posted one year ago
Votes Newest

Answers 9


That is a workaround - but surly not optimal
If we want to generate a dataset from a set of files that are on a local computer (e.g. a local GPU workstation then ran some media transformation) -
then instead of creating the Dataset directly - we need to first upload them and only then use the ClearML sdk.
Do you see any option integrating this kind of workflow into clearml?

  
  
Posted one year ago

I think the main difference is that I can see a value of having access to the raw format within the cloud vendor and not only have it as an archive

I see it does make sense.
Two options, one, as you mentioned use the ClearML StorageManager to upload the files, then register them as external links with Dataset.
Two, I know the enterprise tier has HyperDatasets, that are essentially what you describe, with version control over the "metadata" and "raw storage" on the GCP, including the ability to review the files from the web UI. Unfortunately there is no direct equivalent in the open-source version

  
  
Posted one year ago

OutrageousSheep60 so this should work, no?
ds.upload(output_url='gs://<BUCKET>/', compression=0, chunk_size=100000000000)Notice the chunk size is the maximum size (in bytes) per chunk, so it should basically very large

  
  
Posted one year ago

In order to create a webdataset we need to create tar files -
so we need to unzip and then recreate the tar file.
Additionally when the files are in GCS in the raw format you can easily review them with the preview (e.g. a wav file can be directly listened within the GCP console - web browser).
I think the main difference is that I can see a value of having access to the raw format within the cloud vendor and not only have it as an archive

  
  
Posted one year ago

This does not work -
Since all the files are stored as a single ZIP file (which if unzipped will have all the data), but we would like to have access to the raw files in there original format.

  
  
Posted one year ago

OutrageousSheep60 before I can answer, maybe you can explain why "zipping" them does not fit your workfow ?

  
  
Posted one year ago

we want to use the dataset output_uri as a common ground to create additional dataset formats such as https://webdataset.github.io/webdataset/

  
  
Posted one year ago

OutrageousSheep60 so if this is the case I think you need to add "external links" i.e. upload the individual files to GCS, then register the links to GCS, does that make sense ?

  
  
Posted one year ago

Hi OutrageousSheep60

AS-IS

  • without compressing or breaking it up into chunks.

So for that I would suggest to manually archive it, and upload as external link?
Or are you saying you want to control the compression used by Dataset class ?
https://github.com/allegroai/clearml/blob/72d9b22e0d27f317a364acfeacbcf5c70f852e8c/clearml/datasets/dataset.py#L603

  
  
Posted one year ago
806 Views
9 Answers
one year ago
one year ago
Tags
Similar posts