Unanswered
More Of Pushing Clearml To It'S Data Engineering Limits
The dark theme you have
It's this chrome extension ! I forget it's even on sometimes. It gives you a keyboard shortcut to toggle dark mode on any website. I love it.
Success! Wow, so this means I can use ClearML training/inference pipelines as part of AWS StepFunctions!
My plan is to have a AWS Step Functions state machine (DAG) that treats running a ClearML job as one step (task) in the DAG.
So I'd:
- Create an SQS queue with a lambda worker
- Place JSON events onto the queue, whose contents are parameters that the training/serving pipeline should take, and the ID of the pipeline. Something like
{"pipeline_id": "...", "aws_sfn_task_token": "....", "s3_paths": [...]}
. Thataws_sfn_task_token
can be sent byboto3
to signal to AWS whether the task has succeeded or failed.boto3
has API calls calledSendTaskSuccess
andSendTaskFailure
. - Trigger the lambda to read off a message, and use the ClearML SDK to remotely start the job, passing the the JSON event as one or more parameters.
- (Failure case): If the pipeline task fails, have the trigger react to that and use
boto3
to emit aSendTaskFailure
API call with theaws_sfn_task_token
. The State Machine could react to that and place the inputs onto a Retry queue. - (Success case): If the pipeline task succeeds, have the trigger react to that and use
boto3
to emit aSendTaskSuccess
call, with the appropriate celebratory handling of that 🎉
Note: The event on the queue could include a UUID that acts as a "Trace ID", too. That way, if I can figure out how to use theawslogs
driver for the docker containers run by the ClearML workers, then we can correlate all logs in the greater DAG (assuming all logs contain the Trace ID)
210 Views
0
Answers
one year ago
one year ago