AgitatedDove14 Yes, this is exactly what I was looking for and was running into 413 Request Entity Too Large error during add_files . This is what helped to solve this:
https://github.com/allegroai/clearml/issues/257#issuecomment-736020929
at means I need to pass a single zip file to
path
argument in
add_files
, right?
actually the opposite, you pass a folder (of files) to add_files. Then add_files remembers the files location (and pre calculates the hash of the files content). When you call upload it will actually compress the files that changed into a zip file (or files depending on the chunk size), and upload the files to the destination (as specified in the upload call).
If you pass the s3://bucket/folder as output destination for the upload call, clearml will automatically create a subfolder for the dataset and upload the compressed Zip file there.
Is this what you are looking for ?
AgitatedDove14 yeah, that makes sense, thank you. That means I need to pass a single zip file to path argument in add_files , right?
The files themselves are not on S3 yet, they are stored locally. That's what I want: register a new dataset and upload the data itself to S3
MelancholyElk85
How do I add files without uploading them anywhere?
The files themselves need to be packaged into a zip file (so we have an immutable copy of the dataset). This means you cannot "register" existing files (in your example, files on your S3 bucket?!). The idea is to make sure your dataset is protected against changes on the one hand, but on the other to allow you to change it, and only store the changeset.
Does that make sense ?
I'm afraid that would be the best method. You could probably hack something into clearml sdk yourself since it's open source
there seems to be no way to change default_output_uri from the code.
Dataset.create calls Task.create which in turn accepts add_task_init_call flag. Task.init accepts output_uri , but we cannot add arguments with add_task_init_call , so we cannot change output_uri from Dataset.create , right?
CostlyOstrich36 there is an old similar thread, but they recommend changing the config
add_files . There is no upload call, because add_files uploads files by itself, if I got it correctly
CostlyOstrich36 thank you for the quick answer! I tried it but there is still 413 Request Entity Too Large error, as if it still uses a default fileserver
MelancholyElk85 , it looks like add_files has the following parameter: dataset_path
Try with it 🙂
Changing sdk.development.default_output_uri in clearml.conf seems to be bad idea, because different datasets will likely have different addresses on S3
AgitatedDove14 SuccessfulKoala55 maybe you know. How do I add files without uploading them anywhere?
CostlyOstrich36 hi! yes, as I expected, it doesn't see any files unless I call add_files first
But add_files has no output_url parameter and tries to upload to the default place. This returns 413 Request Entity Too Large error because there are too many files, so using the default location is not an option. Could you please help with this?
From the looks of it, yes. But give it a try to see how it behaves without
Yeah, but do I need to call add_files first?
MelancholyElk85 , I think the upload() function has got the parameter you need: output_uri