
Reputation
Badges 1
129 × Eureka!I'll try to describe the scenario I was thinking would cause ClearML to break down:
Assume:
- We've got a queue called
streaming
- We've got an S3 bucket with images landing inside
- When the images land, they go into a queue
- When there are 100 images in the queue, we trigger a ClearML pipeline to ingest, transform, run inference on the batch, and then write the results somewhere
- Let's say we get 1,000,000 images in the Bucket per hour. That might be 1,000,000 / 100 = 10,000 batches. ...
Man, I owe you lunch sometime @<1523701205467926528:profile|AgitatedDove14> . Thanks for being so detailed in your answers.
Okay! So the pipeline ID is really just a task ID. So cool!
Not sure I fully understand what you mean here...
Sorry, I'll try again. Here's an illustrated example with AWS Step Functions (pretend this is a ClearML pipeline). If the pipeline fails, I'd want to have a chance to do some logic to react to that. Maybe in a step called "on_pipeline_failed" or someth...
possibly cheaper on the cloud (Lambda vs EC2 instance)
Whoa, are you saying there's an autoscaler that doesn't use EC2 instances? I may be misunderstanding, but that would be very cool.
Maybe I should have said: my plan is to use AWS StepFunctions where a single task in the DAG is an entire ClearML pipeline . The non-ClearML steps would orchestrate putting messages into a queue, doing retry logic, and triggering said pipeline.
I think at some point, there has to be some amount of...
I SOLVED IT, NO NEED TO READ FURTHER π
I'm a chump and didn't read the docs: None
Oh, I think I got overexcited and didn't look at this closely. So this ACCESS/SECRET key pair is on the agent-services
container.
I can see that agent-services
is simply a container running `clearml-agent daemon --queue ser...
@<1557175205510516736:profile|ShallowSwan53> at this point, I think this question deserves it's own thread. I'm curious about it too!
That could work! Is that an option? Something that lets me spin up the ClearML and get a services worker to connect to it without manual steps.
At the time that I run python aws_autoscaler.py --remote
, that clearml-services
worker is the only worker on the services
queue. So it will be the worker that picks up the autoscaler task.
But the task seems to be failing on startup due to the CLEARML_API_HOST
not being set, but it is set for the docker container that the agent is running on.
Here's the full autoscaler log where the failure happens if that's helpful.
Sorry, clarifying:
The agent-services
entry in the docker-compose file seems to add a single worker to the services
queue
Hey! Sorry, I don't think I ever solved this for elasticsearch π
@<1523701205467926528:profile|AgitatedDove14> you beautiful person, this is terrific! I do believe SageMaker has some nice monitoring/data drift capabilities that seem interesting, but these points you have here will be a fantastic starting point for my team's analysis of the products. I think this will help balance some of the over-enthusiasm towards using the native AWS solution.
As an infrastructure engineer, I feel that this is a fairly significant shortcoming of ClearML.
Having the ability to pack jobs/tasks onto the same "resource" (underlying server/EC2 instance) would
- simplify the experience for data scientists
- open up a streaming use case, wherein batch (offline) inference could be done directly inside of a ClearML pipeline in reaction to an event/trigger (like new data landing in your data lake). As it is, you can make this work, but if you start to get ...
I did a quick local experiment and observed that credentials created from the UI indeed become invalid if you delete the ClearML volumes.
- starting docker-compose locally
- creating a set of credentials from the UI
- hardcodign those credentials into the docker-compose file
- restarting
- the
agent-services
container started up and successfully became a registered worker - I killed the docker-compose and deleted the volume folders
- restarted the docker-compose (with the same hard-coded...
I'll search around some more when I get time. I have no idea, but it feels like ClearML has already done the hard part which is creating clearml-session
in the first place.
This could be a really low-hanging OSS contribution that could make a real impact π .
The question I'm exploring remains: is it possible to acquire that initial set of ClearML API keys programmatically so that the manual steps of 1-4 above can be avoided for an initial deployment?
If the load balancer it Gateway can do the computation and leverage caching, weβre much safer against DDOS attacks. In general, Iβd prefer not to have our EC2 instance directly exposed to the public Internet.
I took a stab at writing an automated trigger to handle this. The goal is: anytime a pipeline succeeds or fails, let AWS know so that the input records can be placed onto a retry queue (or not)
I'm trying to get a trigger to work in general, and then I'll add the more complex AWS logic. But I seem to be missing a step somewhere:
I wrote a file called set_triggers.py
from clearml.automation.trigger import TriggerScheduler
TRIGGER_SCHEDULER = TriggerScheduler()
from pprint import...
Thank you! I think it does. Itβs just now dawning on me that: because a pipeline is composed of multiple tasks, different tasks in the pipeline could run on different machines. Or more specifically, they could run on different queues, and as you said, in your other response, we could have a Q for smaller CPU-based instances, and another queue larger GPU-based instances.
I like the idea of having a queue dedicated to CPU-based instances that has multiple agents running on it simultaneously....
You know, you could probably add some immortal containers to the docker-compose.yml
that use images with mongodump
and the ES equivalent installed.
The container(s) could have a bash script with a while loop in it that sleeps for 30 minutes and then does a backup. If you installed the AWS CLI inside, it could even take care of uploading to S3.
I like this idea, because docker-compose.yml
could make sure that if the backup container ever dies, it would be restarted.
Hey, thanks for responding!
Does there happen to be ClearML auto-logging... for MLFlow? That would make it super easy for us to migrate our existing training/batch inference jobs to ClearML π
If this works, we might be able to fully replace Metaflow with ClearML!
(Refering to the feature where Metaflow creates Step Functions state machines for you, and then you can use those to trigger event-driven batch jobs in the same way described here)
Will do!
That's fabulous. This is definitely how my team prefers to structure projects. I hadn't gotten around to trying that out in our POC of ClearML yet, but I'm certain this is how our group will solve this problem
In a future iteration, it'd be cool if you could configure presets. Like maybe you have an on-startup.sh
script you really like using to set up your instance, and VS Code extensions you want to pass to the --install-extensions ...
flag
Hi friends, I'm just seeing these new messages. I read these links and I agree with @<1557175205510516736:profile|ShallowSwan53> . It's nice that the webapp has these pages, but what is the workflow to actually use this registry?
Also, @<1557175205510516736:profile|ShallowSwan53> , do you have a specific workflow in mind that you're hoping to get from ClearML?
At BEN, we're experimenting with
- BentoML for model serving. It's a Python REST framework a lot like FastAPI, but with some nice...
Is there a way we can protect a ClearML deployment with a load balancer or API Gateway that is exposed to the whole world, but is protected by authentication so that only authorized clients can get in?
I could imagine other useful automations for reacting to failed tasks that have certain tags, including alerting.
I realize we could move a lot of this logic into ClearML itself: make handler functions that run within the services queue. That would work for logic that is implemented in Python. But I believe it would be harder for our team to detect and respond to failures in the event handler functions if they were placed there because it seems unclear how we could use our existing systems a...
Earlier in the thread they mentioned that the agents are all resilient. So no ongoing tasks should be lost. I imagine even in a large organization, you could afford 5-10 minutes of downtime at 2AM or something.
That said, you'd only have 1 backup per day which could be a big deal depending on the experiments your running. You might want more than that.
This is a low-key open-source project if anyone wanted to contribute. Since the project is early, there are lots of high-impact things, e.g. UI polish, that would be relatively low effort π