AgitatedDove14 yeah, that makes sense, thank you. That means I need to pass a single zip file to path
argument in add_files
, right?
The files themselves are not on S3 yet, they are stored locally. That's what I want: register a new dataset and upload the data itself to S3
AgitatedDove14 Yes, this is exactly what I was looking for and was running into 413 Request Entity Too Large
error during add_files
. This is what helped to solve this:
https://github.com/allegroai/clearml/issues/257#issuecomment-736020929
MelancholyElk85
How do I add files without uploading them anywhere?
The files themselves need to be packaged into a zip file (so we have an immutable copy of the dataset). This means you cannot "register" existing files (in your example, files on your S3 bucket?!). The idea is to make sure your dataset is protected against changes on the one hand, but on the other to allow you to change it, and only store the changeset.
Does that make sense ?
at means I need to pass a single zip file to
path
argument in
add_files
, right?
actually the opposite, you pass a folder (of files) to add_files. Then add_files remembers the files location (and pre calculates the hash of the files content). When you call upload
it will actually compress the files that changed into a zip file (or files depending on the chunk size), and upload the files to the destination (as specified in the upload
call).
If you pass the s3://bucket/folder as output destination for the upload
call, clearml will automatically create a subfolder for the dataset and upload the compressed Zip file there.
Is this what you are looking for ?
CostlyOstrich36 thank you for the quick answer! I tried it but there is still 413 Request Entity Too Large
error, as if it still uses a default fileserver
CostlyOstrich36 hi! yes, as I expected, it doesn't see any files unless I call add_files
first
But add_files
has no output_url
parameter and tries to upload to the default place. This returns 413 Request Entity Too Large
error because there are too many files, so using the default location is not an option. Could you please help with this?
MelancholyElk85 , I think the upload()
function has got the parameter you need: output_uri
MelancholyElk85 , it looks like add_files
has the following parameter: dataset_path
Try with it 🙂
From the looks of it, yes. But give it a try to see how it behaves without
CostlyOstrich36 there is an old similar thread, but they recommend changing the config
there seems to be no way to change default_output_uri
from the code.
Dataset.create
calls Task.create
which in turn accepts add_task_init_call
flag. Task.init
accepts output_uri
, but we cannot add arguments with add_task_init_call
, so we cannot change output_uri
from Dataset.create
, right?
Changing sdk.development.default_output_uri
in clearml.conf
seems to be bad idea, because different datasets will likely have different addresses on S3
I'm afraid that would be the best method. You could probably hack something into clearml sdk yourself since it's open source
add_files
. There is no upload
call, because add_files
uploads files by itself, if I got it correctly
AgitatedDove14 SuccessfulKoala55 maybe you know. How do I add files without uploading them anywhere?
Yeah, but do I need to call add_files
first?