Hi. Yes that totally makes sense. It’s just that we don’t want the logic that does the Jenkins trigger to be in a ClearML handler or task, but rather as a handler that acts as a subscriber in a pub-sub system.
This is because we have a pub-sub architecture that we already use, it can handle retries, etc. also we will likely want multiple systems to react to notifications in the pub sub system. We already have a lot of setup for this.
I guess the conclusion is: I realize it’s possible to trigger things directly from ClearML — either from the services queue or as an individual task, but there are a number of reasons we’d like to integrate with our pub-sub system.
We could write several ClearML trigger handlers in the services queue that put targeted events on the queue, but I was hoping ClearML had a straightforward way to somehow represent ALL ClearML events as JSON so we could land them in our system.
But I believe it would be harder for our team to detect and respond to failures in the event handler functions if they were placed there because it seems unclear how we could use our existing systems and practices to do that.
Okay I think this is the issue, handler functions
are not "supposed" to fail, they are supposed to trigger Tasks, these can fail.
e.g.:
Model Tag Trigger -> handler function creates a Task -> Task does something, like build container, trigger CI/CD etc -> Task fails/complete
This means for every trigger "fired" you have a Task that logs the entire execution of that trigger instance.
Another way of doing that is to use the Model Trigger to signal an event somewhere else like CI/CD or Jenkins etc.
wdyt?
Thanks for the response @<1523701205467926528:profile|AgitatedDove14> !
What would you consider an event?
I was thinking of the TriggerScheduler
's definition of an event. Pretty much, any thing the TriggerSchedule allows you to react to, it'd be great to be able to publish those events to a queue external to ClearML, e.g. a tag added to a model (or removed), a state in a task changing, etc. We'd want as much metadata about that event as possible. So if the event is due to a task being edited, then information about the prior state and the current state of the task, if that's possible.
I could imagine other useful automations for reacting to failed tasks that have certain tags, including alerting.
I realize we could move a lot of this logic into ClearML itself: make handler functions that run within the services queue. That would work for logic that is implemented in Python. But I believe it would be harder for our team to detect and respond to failures in the event handler functions if they were placed there because it seems unclear how we could use our existing systems and practices to do that.
We use NewRelic for monitoring, SQS for event queueing and SNS for triggering workloads. Pushing the event handler logic to AWS allows us to leverage that.
It seems you have a specific workflow in mind, but I'm not sure I follow it. Can you give a specific example ?
Absolutely. So, let's say a DS tags a model in ClearML with "release candidate". It'd be great to have that trigger a number of processes, each with their own retry logic:
- A fairness/bias evaluation, potentially as a task in ClearML itself. This would load the model and run some sample datasets through it. The
- Pipeline to prepare for deployment. Trigger a GitHub Actions pipeline that downloads the model weights and builds them into a docker image together with FastAPI or BentoML code. The build could fail and it'd be nice to have retries/dead letter queue for that.
- An approval process: Add a line item in our internal frontend application which shows models that are meant to be reviewed. The logic for this submission lives in AWS lambda. It'd be good for this to have retries/dead letter queue in case our Lambda function fails.
- Documentation: execute a Lambda function that generates a report about this model in Notion. Our lambda that makes the Notion API calls may fail, so this could do to have retries/dead letter queue.
This is because we have a pub-sub architecture that we already use, it can handle retries, etc. also we will likely want multiple systems to react to notifications in the pub sub system. We already have a lot of setup for this.
How would you integrate with your current system? you have a restapi or similar to trigger event ?
but I was hoping ClearML had a straightforward way to somehow represent ALL ClearML events as JSON so we could land them in our system.
Not sure I'm following where / how is the JSON created
Hi @<1541954607595393024:profile|BattyCrocodile47>
It seems to me that instead of implementing webhooks to react to things like adding a tag to a model
Did you look at this example ?
None
Can we straightforwardly stream ALL ClearML events to another system?
what would you consider an event?
The "basic" object type is Task, a state in task is changed via an api call, would that be an event? a Task is added to the Task object via an "edit" call would that be an event?
Sidenote: having events in a queue gives us some fault tolerance for retry logic.
if our code to emit alerts due to a critical failing task crashes,
What do you mean by emits an alert ? if the code fails then the Task fails and it immediately stops, if this is a component of a Pipeline, then you have retry mechanisms on the pipeline component itself (i.e. it is re-launched)
Tasks in general, should not be considered as short lasting (i.e. seconds) , Tasks are long lasting executions (think 10min and up)
It seems you have a specific workflow in mind, but I'm not sure I follow it. Can you give a specific example ?