Do you think ClearML is a strong option for running event-based training and batch inference jobs in production?
(I'm assuming event-base, you mean triggered by events not streaming data, i.e. ETL etc)
I know of at least a few large organizations doing tat as we speak so I cannot see any reason not to.
That’d include monitoring and alerting. I’m afraid that Metaflow will look far more compelling to our teams for that reason.
Sure, then use Metaflow. The main issue with Metaflow (in my mind) is the lack of visibility into what's running inside the container (i.e. metrics / models etc), and the fact you have no scheduler, only k8s, and k8s has no real scheduling so you end up with jobs stuck in no order...
Since it deploys onto step functions, the scheduling is managed for you and I believe alerts for failing jobs can be set up without adding custom code to every pipeline
I assume you mean Metaflow here?
But idk if the DS teams would want to use two different tools did somewhat similar tasks, so then they may opt to use Metaflow for everything.
No this is bad, what I would do is use ClearML for the R&D, and if you insist on using Metaflow, use the clearml-agent to build dockers from Tasks, then launch those on Metaflow. This means you have full traceability and visibility and you are still using metaflow for the triggering / monitoring.