
Reputation
Badges 1
129 × Eureka!Oh this is thought provoking. Yeah, the idea of using ClearML for R&D is super appealing (to me speaking as an MLOps engineer 😆 ). And having the power of Metaflow's scheduler (on Step Functions with Event Bridge since we'd do the AWS-native deployment) also makes sense to me.
I'll keep asking questions about how we could do event-based jobs with alerting built in on ClearML in a different thread later on.
I pasted your points (anonymously) onto the Metaflow slack to le...
This is totally what I was looking for! Yeah, by "good story for offline batch" I meant, "good feature support for ..."
I bookmarked this comment. I think I'll be doing a POC trying to show this functionality within the next month.
Actually that's wrong: really this is the current volume mount
'-v', '/tmp/clearml_agent.ssh.cbvchse1:/.ssh',
Could changing these values to /root/.ssh
work? Do you know what use within the docker image ClearML is using?
Let's see. The screenshots above are me running on the host, not attaching to a running container. So I believe I do want the keys to be mounted into the running containers.
Oh hooray! So docker-compose manages the restarting of crashed containers? I didn't know that, and that is great 😄
Oh, that is cool. I captured all this. Maybe I'll make a user-data.sh
script and docker-compose.yml
file that brings all these things together. Probably won't have time for a few weeks.
That is great! This is all the motivation I needed to decide to do a POC at some point.
Oh my goodness. Thank you! I'd seen that before, but for some reason it didn't register I could run that with VS Code...
But this config should almost never need to change!
Host clearml-session
HostName localhost
User root
Port 8022
Actually, dumb question: how do I set the setup script for a task?
That could work! Is that an option? Something that lets me spin up the ClearML and get a services worker to connect to it without manual steps.
I'm not seeing a extra_docker_shell_script
in my clearml.conf generated by clearml-agent init
like in this guide
I don't see it as an argument in Task.init
or Task.execute_remotely
So I get output with this one, but the console only shows me the output from my machine. For example, the SSH key is present, and whoami
results in ericriddoch
I have the same behavior whether or not I put task.execute_remotely(...)
before or after the call to run_shell_script()
I'll try to describe the scenario I was thinking would cause ClearML to break down:
Assume:
- We've got a queue called
streaming
- We've got an S3 bucket with images landing inside
- When the images land, they go into a queue
- When there are 100 images in the queue, we trigger a ClearML pipeline to ingest, transform, run inference on the batch, and then write the results somewhere
- Let's say we get 1,000,000 images in the Bucket per hour. That might be 1,000,000 / 100 = 10,000 batches. ...
But from your other answer, I think I'm understanding that you can have multiple agents on a single instance listening to the same queue.
So we could maybe initialize 4 instances of the agent on a single EC2 instance which would allow us to handle a higher volume of small batches concurrently without tying up the entire instance.
Is there a way we can protect a ClearML deployment with a load balancer or API Gateway that is exposed to the whole world, but is protected by authentication so that only authorized clients can get in?
As an infrastructure engineer, I feel that this is a fairly significant shortcoming of ClearML.
Having the ability to pack jobs/tasks onto the same "resource" (underlying server/EC2 instance) would
- simplify the experience for data scientists
- open up a streaming use case, wherein batch (offline) inference could be done directly inside of a ClearML pipeline in reaction to an event/trigger (like new data landing in your data lake). As it is, you can make this work, but if you start to get ...
I'm imagining:
- The EC2 instance would be in a private subnet, accessible only on the VPN (read: VPC)
- The API Gateway and Load Balancer would also be on the VPC and therefore have access to the private subnet BUT the API Gateway or Load Balancer themselves would be exposed to the public internet.
That way, to do the JWT authentication, the load balancer or API Gateway could reach out to the EC2 instance on the private network to authenticate any incoming ClearML SDK requests.
Sorry, clarifying:
The agent-services
entry in the docker-compose file seems to add a single worker to the services
queue
Thanks for replying Martin! (as always)
Do you think ClearML is a strong option for running event-based training and batch inference jobs in production? That’d include monitoring and alerting. I’m afraid that Metaflow will look far more compelling to our teams for that reason.
Since it deploys onto step functions, the scheduling is managed for you and I believe alerts for failing jobs can be set up without adding custom code to every pipeline.
If that’s the case, then we’d probably only...
So, we've been able to run sudo su
and then git clone
with our private repos a few times now
Thanks for this!! I may try it and if I do and it works I’ll look into writing a plugin for ZenML and Metaflow that auto initializes the parent task and registers the steps as child tasks. Super helpful thank you!
Dang! @<1590514584836378624:profile|AmiableSeaturtle81> awesome answer thank you! You seem like an awesome person to know. Definitely connect if you'd like to talk ops stuff sometime. None
That's with the key at /root/.ssh/id_rsa
The dark theme you have
It's this chrome extension ! I forget it's even on sometimes. It gives you a keyboard shortcut to toggle dark mode on any website. I love it.
Success! Wow, so this means I can use ClearML training/inference pipelines as part of AWS StepFunctions!
My plan is to have a AWS Step Functions state machine (DAG) that treats running a ClearML job as one step (task) in t...
I see. Is it possible for two agents to be utilizing the same GPU? (like if the machine has a terrific GPU, but only one of them?)
Let's see. The task log? I think this is it.
Oh there's parallelization as well. You could have step 1 gather the data, and then fan out to N parallel steps that all do different things with the data, for example hyper parameter tuning