I see, so its a path. Another question, as far as i can tell, clearml-data will download entire datasets before starting training. This isn't very ideal when we are dealing with billions of datasets (E.g. WE might want to download a subset at a time, send to GPU for training and then use the CPU to concurrently pull another subset.). Any comments on this?
Hi SubstantialElk6 ,
You can configuration S3 credentials on your ~/clearml.conf
file, or with environment variables:os.environ['AWS_ACCESS_KEY_ID'] ="***" os.environ['AWS_SECRET_ACCESS_KEY'] = "***" os.environ['AWS_DEFAULT_REGION'] = "***"
SubstantialElk6 you can try:
dataset_upload_task = Dataset.get(dataset_id=dataset_task) path_with_data = dataset_upload_task.get_local_copy()
let me check if I can think about something else (I know the enterprise edition has full support for such thing and for unstructured data too).
BTW ClearML always use cache, so the big download is done only once.
get_local_copy()
will return the entire dataset, but you can divide the dataset parts and have the same parent for all of them, can this work?
like create multiple datasets?
create parent (all) - upload to S3
create child1 (first 100k)
create child2 (second 100k)...blah blah
Then only pull indices from children. Technically workable but not sure if its best approach since different ppl have different batch sizes in mind.
Got that thanks. Just to better understand. When clearml-data upload my recursive folder of image data, it convert it into a compressed form with a different folder structure than the original datasets.
When my software pull the data, i'm returned a str. How would we manipulate the data from there?