I see, so its a path. Another question, as far as i can tell, clearml-data will download entire datasets before starting training. This isn't very ideal when we are dealing with billions of datasets (E.g. WE might want to download a subset at a time, send to GPU for training and then use the CPU to concurrently pull another subset.). Any comments on this?
Got that thanks. Just to better understand. When clearml-data upload my recursive folder of image data, it convert it into a compressed form with a different folder structure than the original datasets.
When my software pull the data, i'm returned a str. How would we manipulate the data from there?
like create multiple datasets?
create parent (all) - upload to S3
create child1 (first 100k)
create child2 (second 100k)...blah blah
Then only pull indices from children. Technically workable but not sure if its best approach since different ppl have different batch sizes in mind.
Hi SubstantialElk6 ,
You can configuration S3 credentials on your
~/clearml.conf file, or with environment variables:
os.environ['AWS_ACCESS_KEY_ID'] ="***" os.environ['AWS_SECRET_ACCESS_KEY'] = "***" os.environ['AWS_DEFAULT_REGION'] = "***"