Hi! How To Add Files Locally To

Answered

Hi! How to add files locally to dataset and then upload to a custom s3 location? The location should be specified within python code, NOT clearml.conf . default_output_uri is not an option! The most intuitive way is seemingly add_files then upload , from this example: https://github.com/allegroai/clearml/blob/master/examples/datasets/dataset_creation.py

BUT it doesn't work this way: add_files not only adds files, but also uploads them to a default location! Why? How to just add files locally?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					MelancholyElk85
				
					0
					 × 1

Votes Newest

Answers 18

CostlyOstrich36 there is an old similar thread, but they recommend changing the config

https://clearml.slack.com/archives/CTK20V944/p1626722835308600?thread_ts=1626600358.282400&cid=CTK20V944

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					MelancholyElk85
				
					0
					 × 1

AgitatedDove14 Yes, this is exactly what I was looking for and was running into 413 Request Entity Too Large error during add_files . This is what helped to solve this:
https://github.com/allegroai/clearml/issues/257#issuecomment-736020929

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					MelancholyElk85
				
					0
					 × 1

Yeah, but do I need to call add_files first?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					MelancholyElk85
				
					0
					 × 1

I'm afraid that would be the best method. You could probably hack something into clearml sdk yourself since it's open source

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					CostlyOstrich36
				
					0

CostlyOstrich36 thank you for the quick answer! I tried it but there is still 413 Request Entity Too Large error, as if it still uses a default fileserver

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					MelancholyElk85
				
					0
					 × 1

MelancholyElk85

How do I add files without uploading them anywhere?

The files themselves need to be packaged into a zip file (so we have an immutable copy of the dataset). This means you cannot "register" existing files (in your example, files on your S3 bucket?!). The idea is to make sure your dataset is protected against changes on the one hand, but on the other to allow you to change it, and only store the changeset.
Does that make sense ?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Changing sdk.development.default_output_uri in clearml.conf seems to be bad idea, because different datasets will likely have different addresses on S3

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					MelancholyElk85
				
					0
					 × 1

From the looks of it, yes. But give it a try to see how it behaves without

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					CostlyOstrich36
				
					0

So now it works smoothly

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					MelancholyElk85
				
					0
					 × 1

add_files . There is no upload call, because add_files uploads files by itself, if I got it correctly

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					MelancholyElk85
				
					0
					 × 1

at means I need to pass a single zip file to

path

argument in

add_files

, right?

actually the opposite, you pass a folder (of files) to add_files. Then add_files remembers the files location (and pre calculates the hash of the files content). When you call upload it will actually compress the files that changed into a zip file (or files depending on the chunk size), and upload the files to the destination (as specified in the upload call).
If you pass the s3://bucket/folder as output destination for the upload call, clearml will automatically create a subfolder for the dataset and upload the compressed Zip file there.
Is this what you are looking for ?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Does it fail at add_files or at upload ?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					CostlyOstrich36
				
					0

MelancholyElk85 , I think the upload() function has got the parameter you need: output_uri

https://github.com/allegroai/clearml/blob/a68f832a8a12665f7705cfbf14c5fe195f6d7469/clearml/datasets/dataset.py#L323

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					CostlyOstrich36
				
					0

AgitatedDove14 SuccessfulKoala55 maybe you know. How do I add files without uploading them anywhere?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					MelancholyElk85
				
					0
					 × 1

there seems to be no way to change default_output_uri from the code.

Dataset.create calls Task.create which in turn accepts add_task_init_call flag. Task.init accepts output_uri , but we cannot add arguments with add_task_init_call , so we cannot change output_uri from Dataset.create , right?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					MelancholyElk85
				
					0
					 × 1

MelancholyElk85 , it looks like add_files has the following parameter: dataset_path
Try with it 🙂

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					CostlyOstrich36
				
					0

AgitatedDove14 yeah, that makes sense, thank you. That means I need to pass a single zip file to path argument in add_files , right?

The files themselves are not on S3 yet, they are stored locally. That's what I want: register a new dataset and upload the data itself to S3

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					MelancholyElk85
				
					0
					 × 1

CostlyOstrich36 hi! yes, as I expected, it doesn't see any files unless I call add_files first

But add_files has no output_url parameter and tries to upload to the default place. This returns 413 Request Entity Too Large error because there are too many files, so using the default location is not an option. Could you please help with this?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					MelancholyElk85
				
					0
					 × 1

Write your answer

2K Views

18 Answers

3 years ago

2 years ago