In all the steps I want to store them as artifacts to s3 because it’s very convenient.
The last step should merge them all, ie. it needs to know all the other artifacts of the previous steps
So in which scenario do you want to keep those folders as artifacts and where would you like to store them?
I think it depends on your code and the pipeline setup. You can also cache steps - avoiding the entire need to worry about artifacts.
Do you mean if they are shared between steps or if each step creates a duplicate?
So in my use case each step would create a folder (potentially big) and would store it as an artifact. The last step should “merge” all the pervious folders. The idea is to split the work among multiple machines (in parallel). I would like to avoid that these potentially big folder artifacts are also stored in the pipeline task, because this one will be running on the services queue in the clearml-server instance, that will definitely not have enough space to handle all of them
I guess I can have a workaround by passing the pipeline controller task id to the last step, so that the last step can download all the artifacts from the controller task.
CostlyOstrich36 super thanks for confirming! I have then the follow-up question: are the artifacts duplicated (copied)? or just referenced?
I also would like to avoid any copy of these artifacts on s3 (to avoid double costs, since some folders might be big)
JitteryCoyote63 , heya, yes it is :)
You can save the entire folder as an artifact.
So if all artifacts are logged in the pipeline controller task, I need the last task to access all the artifacts from the pipeline task. I need to execute something like PipelineController.get_artifact()
in the last step task