[Pipeline] Hey, Is It Possible To Specify The Output Uri For Pipelines And Their Components Using Pipeline Decorators? I Would Like To Store Pipeline Artifacts And Component Artifacts On S3.

Answered

[Pipeline] Hey, is it possible to specify the output uri for Pipelines and their Components using Pipeline decorators? I would like to store Pipeline artifacts and Component artifacts on S3.

  				
Posted 
	2 years ago

					More  		
  Report
		
					ReassuredOwl55
				
					0
					 × 1

Votes Newest

Answers 7

Hmm. Okay. Thanks

  				
Posted 
	2 years ago

					More  		
  Report
		
					ReassuredOwl55
				
					0
					 × 1

Ahh that’s great, thank you.

And then I could use storage manager or whatever to get the files. Perfect

  				
Posted 
	2 years ago

					More  		
  Report
		
					ReassuredOwl55
				
					0
					 × 1

Hi ReassuredOwl55
The easiest is to configure it as default output_uri in the clearml.conf of file the agent, wdyt?
https://github.com/allegroai/clearml-agent/blob/ebb955187dea384f574a52d059c02e16a49aeead/docs/clearml.conf#L430

  				
Posted 
	2 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

I have added a lot of detail to this, sorry.

The inline comments in the code talk about that specific script/implementation.

I have added a lot of context in the doc string at the top.

  				
Posted 
	2 years ago

					More  		
  Report
		
					ReassuredOwl55
				
					0
					 × 1

The return objects were stored to S3 but PipelineDecorator.upload_artifact still uploaded to the file server. Not sure what was up with that but as explained in my next comment it did work when I tried again.

It also seems that PipelineDecorator.upload_artifact is not compatible with caching, sadly, but that is another issue for another thread that I will be starting on Monday.

Have a good weekend

  				
Posted 
	2 years ago

					More  		
  Report
		
					ReassuredOwl55
				
					0
					 × 1

So the way it works when you run a component the return value with the entire function execution is cached, basically:

this did NOT add the artifact to the pipeline via caching on subsequent runs ❌

you just need to do:

PipelineDecorator.upload_artifact(name='images', artifact_object=img_dir, wait_on_upload=True)
return Task.current_task().artifacts['images'].url

This will return the URL of the uploaded images (i.e. S3 bucket)
which means if this is cached you will get it

image_bucket = gen_random_images()
second_step(image_bucket)

BTW:
you can always get the currently executed Task (of any part of the pipeline) with Task.current_task() no need to call "pipe._get_pipeline_task()"

  				
Posted 
	2 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

It also seems that

PipelineDecorator.upload_artifact

is not compatible with caching, sadly,

Both use the exact same mechanism of uploading artifacts (i.e. including caching for downloaded artifacts), in terms of caching pipeline components, this is on a component level (i.e. same code/task same arguments, equals cache hit)
What exactly are you getting ? how is it that the "PipelineDecorator.upload_artifact" uploads to a different storage ? is that reproducible ?

  				
Posted 
	2 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Write your answer

2K Views

7 Answers

2 years ago