Thanks, yes I am familiar with all of the above.
We want to validate the entire pipeline . I am not talking about using a ClearML Pipeline as the validator (which is the case in your examples).
Here is some further detail that will hopefully make things more obvious:
- The pipeline is a series of steps which creates a feature store – in fact, you might even call it a feature pipeline!
- Each pipeline step takes responsibility for a different bit of feature engineering.
- We want to validate each of the feature engineering steps individually, and validate the pipeline as a whole. - At its simplest, this could just mean checking that all of the steps and the pipeline itself have completed successfully (by checking their “Task status”). - If everything is as it should be, we add a tag to this feature pipeline Task to say that it is suitable for use in production and can be cloned safely in the future.
- (When used in production, we would of course change the values of the input args of the pipeline before running, such that the feature pipeline takes in different data)
The validation itself need not even occur in a ClearML Setting at all – that is yet to be decided, and (for the purposes of this discussion) does not really matter! We may decide to set up the validation as a Task or a Pipeline, or we may decide to handle the validation without ClearML logging of any kind, but that is an entirely separate issue.
And validation must happen outside of the pipeline, because the clearml pipeline as a whole is the product that is being validated: the feature pipeline itself is what is being validated.
########################################################################
If we were doing this manually, we would:
- run
pipeline_script.py
which contains the pipeline code as decorators.- This would serialise the pipeline and upload it to the ClearML server, and run it on whatever test input we have defined – we have everything set up already to do this remotely, this is not an issue.- manually look on the ClearML Pipeline page and copy the Task ID of the pipeline run. - paste this Task ID into our validation script and run it. - The validation script would get the task (of the pipeline run) from the ClearML Server based on its Task ID, check the “Task status” of each of the pipeline steps and of the pipeline itself, and investigate each of the artifacts that were produced throughout the pipeline and its steps
- Dependent upon some criteria we may add a tag which enables us to identify which pipeline(s) on the ClearML Server has(/have) passed the tests – and again, the pipeline that is being tagged is the product and is not the validator .
However, of course looking at the ClearML pipelines page and deciding which pipeline ID corresponds to the pipeline you have just created withpipeline_script.py
is not a CI/CD solution. We therefore need a way of getting that pipeline (task) ID programatically.
########################################################################
Hopefully, this is enough context and explanation to show why your earlier pseudo-code is what we really need to do this.
I apologise for having said something after you posted that comment which confused the situation; that was not my intention!
pipeline_a_id = os.system("python3 create_pipeline_a.py")
This would create and run the pipeline remotely and return the Task ID.
We would then wait for the task with that ID to finish running.
We would then pass the Task ID into the validate function, and add a tag if it passed the tests.
# This code is just illustrative
pipeline_task = Task.get_task(task_id=pipeline_a_id)
pipeline_task.wait_for_status(raise_on_status=())
passed = validate_pipeline(pipeline_a_id)
if passed:
pipeline_task.add_tags("passed")
else:
pipeline_task.add_tags("failed")