Hi @<1541954607595393024:profile|BattyCrocodile47>
Can you help me make the case for ClearML pipelines/tasks vs Metaflow?
Based on my understanding
- Metaflow cannot have custom containers per step (at least I could not find where to push them)
- DAG only execution. I.e. you cannot have logic driven flows
- cannot connect git repositories to different component in the pipeline
- Visualization of results / artifacts is rather limited
- Only Kubernetes is supported as underlying provisioning - Although plugins for IaaS (AWS/GCP/Azure) are available, they do not seem trivial to configure, and seem to need to be configured as part of the pipeline itself (but I might be wrong here)- No caching available (i.e. if a component was already executed wiht the same arguments/code reuse it)
- I do not believe there is any role based access control on top (i.e. it seems everyone is an "admin")
As a rule of thumb, Metaflow was created to build inference batch piplines, and I think it is very good at it as alternative to for example SageMaker.
I was not however design to be a tool for R&D to production acceleration, and this is exactly what ClearML does. ClearML helps you build the pipeliens as part of the research and engineering, not as a standalone "production" process. This means flexibility and visibility are key concepts that seem to be missing from Metaflow, that is designed with more "devops" in mind, rather than ML engineers / data scientist
My two cents of course ๐ and if anyone feels differently or want to share their experience please do!
Thanks! a few thoughts below ๐
- not true โ you can specify the image you want for each stepMy apologies, looking at the release notes, it was added a while back and I have not noticed ๐
- re: role-base access control - see Outerbounds Platform that provides a layer of security and auth features required by enterprisesRole based access meaning limiting access in metaflow i.e. specific users/groups can only access specific projects etc. Not authenticaion.
- "R&D to production acceleration" is what Metaflow has been about since the very beginning .Hmm I think it's like saying Jenkins does that, and in both cases this is automating a process, but the question is always how often do you change the process. Usually not very often, hence not continuous R&D production (if that makes sense)
Thanks for replying Martin! (as always)
Do you think ClearML is a strong option for running event-based training and batch inference jobs in production? Thatโd include monitoring and alerting. Iโm afraid that Metaflow will look far more compelling to our teams for that reason.
Since it deploys onto step functions, the scheduling is managed for you and I believe alerts for failing jobs can be set up without adding custom code to every pipeline.
If thatโs the case, then weโd probably only use ClearML for the R&D phase and then deploy with Metaflow. But idk if the DS teams would want to use two different tools did somewhat similar tasks, so then they may opt to use Metaflow for everything.
We use MLFlow as our model registry. Maybe we could use ClearML for experiment tracking only since the UI is much better for that. Maybe we could completely switch to ClearML for model registry. Other tools have integrations with MLFlow, though, such as BentoML which draws us not to do that :P
Oh this is thought provoking. Yeah, the idea of using ClearML for R&D is super appealing (to me speaking as an MLOps engineer ๐ ). And having the power of Metaflow's scheduler (on Step Functions with Event Bridge since we'd do the AWS-native deployment) also makes sense to me.
I'll keep asking questions about how we could do event-based jobs with alerting built in on ClearML in a different thread later on.
I pasted your points (anonymously) onto the Metaflow slack to let them speak to any updates that have happened in their product. If you care to read it, this is about as accurate a view as you can get on what Metaflow is today since these were written by a Metaflow founder and core contributor ๐
Person 1:
Point by point:
-
not true โ you can specify the image you want for each step
-
accurate
-
not sure what that means
-
there are cards and UI and integrations with other tools like Comet. So probably more limited than some and less limited than others ๐
-
Iโll let the OB folks comment on this but yes, I think kube support is probably the most fleshed out (pure AWS is also pretty good since that is where it started ๐ )
-
correct โ itโs a feature actually. We did discuss this quite a bit and it is really hard to guarantee side-effect free execution in python
-
Iโll let OB comment on this.
Person 2: -
re: caching -
resume
does what most systems mean by caching but like Romain mentioned, we don't make it overly magical as a feature -
re: kubernetes -
@batch
andstep-functions
are still great options which don't require K8s. I'd agree that the deployment is not trivial in the literal sense of the word ๐ The terraform templates make it quite easy though -
re: role-base access control - see Outerbounds Platform that provides a layer of security and auth features required by enterprises
-
"R&D to production acceleration" is what Metaflow has been about since the very beginning .
It is true though that there are plenty of tools targeting data scientists which provide a nice GUI that make it easier to get started with a few clicks - DataRobot is a great example!
While tools like these seem appealing at the first sight, often they have hard time supporting real-world production use cases with constantly changing data, involved business logic, larger scale, and multiple people working together.
Real-world ML systems shouldn't be islands. They must work well with the surrounding infrastructure and policies. Metaflow is serious about providing a solution that balances requirements both on the engineering as well as on the data science side - so data scientists can develop systems that engineers can happily approve - which might contribute to the impression that "Metaflow is designed with more "devops" in mind".
tl;dr Metaflow is designed with both devops and data scientists in mind!
Do you think ClearML is a strong option for running event-based training and batch inference jobs in production?
(I'm assuming event-base, you mean triggered by events not streaming data, i.e. ETL etc)
I know of at least a few large organizations doing tat as we speak so I cannot see any reason not to.
Thatโd include monitoring and alerting. Iโm afraid that Metaflow will look far more compelling to our teams for that reason.
Sure, then use Metaflow. The main issue with Metaflow (in my mind) is the lack of visibility into what's running inside the container (i.e. metrics / models etc), and the fact you have no scheduler, only k8s, and k8s has no real scheduling so you end up with jobs stuck in no order...
Since it deploys onto step functions, the scheduling is managed for you and I believe alerts for failing jobs can be set up without adding custom code to every pipeline
I assume you mean Metaflow here?
But idk if the DS teams would want to use two different tools did somewhat similar tasks, so then they may opt to use Metaflow for everything.
No this is bad, what I would do is use ClearML for the R&D, and if you insist on using Metaflow, use the clearml-agent to build dockers from Tasks, then launch those on Metaflow. This means you have full traceability and visibility and you are still using metaflow for the triggering / monitoring.
So, we've already got a model registry: MLFlow
And we've got a serving framework we really like: BentoML and FastAPI
The debate is between ClearML and Metaflow for
- training models during the research phase
- re-training models on a schedule or event-based trigger in production
- running batch inference jobs on a schedule or event-based trigger in production
For these functions, Metaflow offers:
- triggering: integration with AWS event bridge. It's really easy to use Boto3 and AWS access keys to emit events for Metaflow DAGs. It's nice not to have to worry about networking for this.
- Scheduling: The fact that Metaflow uses stepfunctions is reassuring.
- observability: this lovely flame graph where you can view the logs and duration of each step in the DAG, it's easy to view all the DAG runs including the ones that have failed. Ideally, we would be able to see the status of all of our pipelines in a single UI.
- alerts: it's easy to set up alerts for all DAGs at once. Actually, this may not be set up the way I imagine. But I really want- Data scientists to author their own pipelines
- Data scientists not to have to worry / understand how to set up alerts for failed tasks/pipelines
- Every pipeline to be set up with alerts--maybe this is just as hard with Metaflow as it is with ClearML.
Is there a low-effort way to set all these things up with ClearML open source or enterprise?