Our datasets are more than 1TB in size and will grow in size (probably 4TB and up), this means we also need 4TB local storage
Yes, because somewhere you will have to store your unzipped files.
Or you point to the S3 bucket, and fetch the data when you need to access it (ore prefetch it) with the S3 links the Dataset stores, i.e. only when accessed
Yes, but does add_external_files makes chunked zips as add_files do?
No it references them, (i.e. meta-data not actually doing something with the files themselves)
I need the zipping, chunking to manage millions of files
That makes sens, if that's the case you will have to download those files anyway, and then add them with add_files
you can use the StoargeManager to download them, and then add them from the local copy (this will zip/chunk them)
None
Yes, but does add_external_files makes chunked zips as add_files do?
Hi @<1590514584836378624:profile|AmiableSeaturtle81>
I think you should use add_external_files
, instead of add_files
(which is for local files)
None
Our datasets are more than 1TB in size and will grow in size (probably 4TB and up), this means we also need 4TB local storage just to upload the dataset back in zipped format. This is not a good solution.
What we can do I guess is do the downloading locally by some chunks of files?
Download locally 100 files, add_to_clearml dataset, repeat
I need the zipping, chunking to manage millions of files