Hi Friends, We Got On A Sales Call With Clearml Yesterday And A Discussion About Webhooks Came Up.

Answered

Hi friends, we got on a sales call with ClearML yesterday and a discussion about webhooks came up.

ClearML seems to not natively implement webhooks

It seems to me that instead of implementing webhooks to react to things like adding a tag to a model, or having a task fail, ClearML's approach is to have you write your own Python script running in the services queue which polls the ClearML API as fast as once per minute.

I'd be nervous to rely on this for the use cases we're considering because if it's our Python code running in the services queue, then we'd be opened up to introducing bugs in our services queue task where we may fail to respond to key events.

Can we straightforwardly stream ALL ClearML events to another system?

Rather than running our business critical logic for responding to ClearML events in the services queue, is there a way we could simply stream ALL ClearML events and their metadata to a queue outside of ClearML, such as AWS SQS. This way, the logic for reacting to events could use the tools that we feel best about operating/developing, e.g. stepfunctions, lambda, Event Bridge, etc. which would consume the events from the queue.

Sidenote: having events in a queue gives us some fault tolerance for retry logic. For example, if our code to emit alerts due to a critical failing task crashes, the event could go onto a retry queue.

  				
Posted 
	one year ago

					More  		
  Report
		
					BattyCrocodile47
				
					0
					 × 1

Votes Newest

Answers 7

I could imagine other useful automations for reacting to failed tasks that have certain tags, including alerting.

I realize we could move a lot of this logic into ClearML itself: make handler functions that run within the services queue. That would work for logic that is implemented in Python. But I believe it would be harder for our team to detect and respond to failures in the event handler functions if they were placed there because it seems unclear how we could use our existing systems and practices to do that.

We use NewRelic for monitoring, SQS for event queueing and SNS for triggering workloads. Pushing the event handler logic to AWS allows us to leverage that.

  				
Posted 
	one year ago

					More  		
  Report
		
					BattyCrocodile47
				
					0
					 × 1

But I believe it would be harder for our team to detect and respond to failures in the event handler functions if they were placed there because it seems unclear how we could use our existing systems and practices to do that.

Okay I think this is the issue, handler functions are not "supposed" to fail, they are supposed to trigger Tasks, these can fail.
e.g.:
Model Tag Trigger -> handler function creates a Task -> Task does something, like build container, trigger CI/CD etc -> Task fails/complete
This means for every trigger "fired" you have a Task that logs the entire execution of that trigger instance.

Another way of doing that is to use the Model Trigger to signal an event somewhere else like CI/CD or Jenkins etc.
wdyt?

  				
Posted 
	one year ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

It seems you have a specific workflow in mind, but I'm not sure I follow it. Can you give a specific example ?

Absolutely. So, let's say a DS tags a model in ClearML with "release candidate". It'd be great to have that trigger a number of processes, each with their own retry logic:

A fairness/bias evaluation, potentially as a task in ClearML itself. This would load the model and run some sample datasets through it. The
Pipeline to prepare for deployment. Trigger a GitHub Actions pipeline that downloads the model weights and builds them into a docker image together with FastAPI or BentoML code. The build could fail and it'd be nice to have retries/dead letter queue for that.
An approval process: Add a line item in our internal frontend application which shows models that are meant to be reviewed. The logic for this submission lives in AWS lambda. It'd be good for this to have retries/dead letter queue in case our Lambda function fails.
Documentation: execute a Lambda function that generates a report about this model in Notion. Our lambda that makes the Notion API calls may fail, so this could do to have retries/dead letter queue.

  				
Posted 
	one year ago

					More  		
  Report
		
					BattyCrocodile47
				
					0
					 × 1

Thanks for the response AgitatedDove14 !

What would you consider an event?

I was thinking of the TriggerScheduler 's definition of an event. Pretty much, any thing the TriggerSchedule allows you to react to, it'd be great to be able to publish those events to a queue external to ClearML, e.g. a tag added to a model (or removed), a state in a task changing, etc. We'd want as much metadata about that event as possible. So if the event is due to a task being edited, then information about the prior state and the current state of the task, if that's possible.

  				
Posted 
	one year ago

					More  		
  Report
		
					BattyCrocodile47
				
					0
					 × 1

This is because we have a pub-sub architecture that we already use, it can handle retries, etc. also we will likely want multiple systems to react to notifications in the pub sub system. We already have a lot of setup for this.

How would you integrate with your current system? you have a restapi or similar to trigger event ?

but I was hoping ClearML had a straightforward way to somehow represent ALL ClearML events as JSON so we could land them in our system.

Not sure I'm following where / how is the JSON created

  				
Posted 
	one year ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Hi BattyCrocodile47

It seems to me that instead of implementing webhooks to react to things like adding a tag to a model

Did you look at this example ?
None

Can we straightforwardly stream ALL ClearML events to another system?

what would you consider an event?
The "basic" object type is Task, a state in task is changed via an api call, would that be an event? a Task is added to the Task object via an "edit" call would that be an event?

Sidenote: having events in a queue gives us some fault tolerance for retry logic.

if our code to emit alerts due to a critical failing task crashes,

What do you mean by emits an alert ? if the code fails then the Task fails and it immediately stops, if this is a component of a Pipeline, then you have retry mechanisms on the pipeline component itself (i.e. it is re-launched)

Tasks in general, should not be considered as short lasting (i.e. seconds) , Tasks are long lasting executions (think 10min and up)

It seems you have a specific workflow in mind, but I'm not sure I follow it. Can you give a specific example ?

  				
Posted 
	one year ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Hi. Yes that totally makes sense. It’s just that we don’t want the logic that does the Jenkins trigger to be in a ClearML handler or task, but rather as a handler that acts as a subscriber in a pub-sub system.

This is because we have a pub-sub architecture that we already use, it can handle retries, etc. also we will likely want multiple systems to react to notifications in the pub sub system. We already have a lot of setup for this.

I guess the conclusion is: I realize it’s possible to trigger things directly from ClearML — either from the services queue or as an individual task, but there are a number of reasons we’d like to integrate with our pub-sub system.

We could write several ClearML trigger handlers in the services queue that put targeted events on the queue, but I was hoping ClearML had a straightforward way to somehow represent ALL ClearML events as JSON so we could land them in our system.

  				
Posted 
	one year ago

					More  		
  Report
		
					BattyCrocodile47
				
					0
					 × 1

Write your answer

1K Views

7 Answers

one year ago