In all the steps I want to store them as artifacts to s3 because it’s very convenient.
The last step should merge them all, ie. it needs to know all the other artifacts of the previous steps
So if all artifacts are logged in the pipeline controller task, I need the last task to access all the artifacts from the pipeline task. I need to execute something like PipelineController.get_artifact()
in the last step task
I guess I can have a workaround by passing the pipeline controller task id to the last step, so that the last step can download all the artifacts from the controller task.
So in which scenario do you want to keep those folders as artifacts and where would you like to store them?
So in my use case each step would create a folder (potentially big) and would store it as an artifact. The last step should “merge” all the pervious folders. The idea is to split the work among multiple machines (in parallel). I would like to avoid that these potentially big folder artifacts are also stored in the pipeline task, because this one will be running on the services queue in the clearml-server instance, that will definitely not have enough space to handle all of them
I also would like to avoid any copy of these artifacts on s3 (to avoid double costs, since some folders might be big)
JitteryCoyote63 , heya, yes it is :)
You can save the entire folder as an artifact.
Do you mean if they are shared between steps or if each step creates a duplicate?
I think it depends on your code and the pipeline setup. You can also cache steps - avoiding the entire need to worry about artifacts.
CostlyOstrich36 super thanks for confirming! I have then the follow-up question: are the artifacts duplicated (copied)? or just referenced?