Reputation
Badges 1
124 × Eureka!@<1523701070390366208:profile|CostlyOstrich36> Oh thatโs smart. Is that to make sure no transactions happen during the backup? Would there be a risk of ongoing or pending tasks somehow getting corrupted if you shut the server down?
If this works, we might be able to fully replace Metaflow with ClearML!
(Refering to the feature where Metaflow creates Step Functions state machines for you, and then you can use those to trigger event-driven batch jobs in the same way described here)
To do this, I think I need to know:
- Can you trigger a pre-existing Pipeline via the ClearML REST API? I'd want to have a Lambda function trigger the Pipeline for a batch without needing to have all the Pipeline code in the lambda function. Something like
curl -u '<clearml credetials>'
None,...
- [probably a big ask] If the pipeline succeeds/fails, can ClearML emit an event that I can react to? Like mayb...
This is totally what I was looking for! Yeah, by "good story for offline batch" I meant, "good feature support for ..."
I bookmarked this comment. I think I'll be doing a POC trying to show this functionality within the next month.
I'm not seeing a extra_docker_shell_script
in my clearml.conf generated by clearml-agent init
like in this guide
Thank you! I think it does. Itโs just now dawning on me that: because a pipeline is composed of multiple tasks, different tasks in the pipeline could run on different machines. Or more specifically, they could run on different queues, and as you said, in your other response, we could have a Q for smaller CPU-based instances, and another queue larger GPU-based instances.
I like the idea of having a queue dedicated to CPU-based instances that has multiple agents running on it simultaneously....
I don't know about this, but could you turn your whole project into a pip-installable package using a setup.py
and/or pyproject.toml
?
I've never tried this, but maybe then you could do pip install -e .
locally before executing the task. Then execute. And then maybe the pip freeze
that ClearML does would contain the symlink to your directory.
(so that from my_package import ...
statements would work)
Hi. Yes that totally makes sense. Itโs just that we donโt want the logic that does the Jenkins trigger to be in a ClearML handler or task, but rather as a handler that acts as a subscriber in a pub-sub system.
This is because we have a pub-sub architecture that we already use, it can handle retries, etc. also we will likely want multiple systems to react to notifications in the pub sub system. We already have a lot of setup for this.
I guess the conclusion is: I realize itโs possible...
I could imagine other useful automations for reacting to failed tasks that have certain tags, including alerting.
I realize we could move a lot of this logic into ClearML itself: make handler functions that run within the services queue. That would work for logic that is implemented in Python. But I believe it would be harder for our team to detect and respond to failures in the event handler functions if they were placed there because it seems unclear how we could use our existing systems a...
Wow, that is seriously impressive.
Oh awesome @<1523701132025663488:profile|SlimyElephant79> ! If you want to take a look, I made a big list of things to add. I'm working on a docker-compose.yaml
file so we can have a good local development environment.
There's a lot of room to improve this from cleaning up the code to adding features on the list.
I took a look
- I think the Outerbounds extension (the one in my screenshot) is currently closed source. That makes sense to me. A bit sad because it is highly similar.
- Another example could be the AWS ToolKit extension. But sadly, it's hardly a "minimal example". I was thinking it's relevant because it uses your local
~/.aws/
folder, which is similar to what we'd want to do.
In a future iteration, it'd be cool if you could configure presets. Like maybe you have an on-startup.sh
script you really like using to set up your instance, and VS Code extensions you want to pass to the --install-extensions ...
flag
Ah, but it's probably worth noting that the docker-compose.yml
does register the EC2 isntance that the server is running on as an agent listening on the services
queue, so ongoing tasks in that queue that happen to be placed on the server would get terminated when docker-compose down
is run.
I see. Is it possible for two agents to be utilizing the same GPU? (like if the machine has a terrific GPU, but only one of them?)
Oh duh, thanks. What about non standard entrypoints (as opposed to arguments) like accelerate launch train.py
?
My understanding may be bad. Say I have a single EC2 instance. Is that instance only able to handle one task at a time?
Or can I start multiple instances of the clearml-agent
process on it and then have one task per agent?
And if that's the case, can we have multiple agents on the EC2 instance listening to the same queue, e.g. default
. Or would this only work if they were listening to different queues?
Yes, it's pretty lame that a clearml-agent
can only process one task at a time if it's not listening to a services
queue ๐ค
Hey, thanks for responding!
Does there happen to be ClearML auto-logging... for MLFlow? That would make it super easy for us to migrate our existing training/batch inference jobs to ClearML ๐
So, we've already got a model registry: MLFlow
And we've got a serving framework we really like: BentoML and FastAPI
The debate is between ClearML and Metaflow for
- training models during the research phase
- re-training models on a schedule or event-based trigger in production
- running batch inference jobs on a schedule or event-based trigger in production
Iโd really prefer it was modular enough to use serving with any model registry
Oh that's interesting. To serve a model from MLflow, would you have to copy it over to ClearML first?
Thank you! For now, it's kind of nice that it just picks up your credentials from your conf file. No extra setup required beyond the onboarding ClearML has you do ๐
And look! It's working, assuming you start the clearml session up yourself:
Here's the repo: I've recorded a few update videos documenting how we learned about authoring VS Code extensions and how we got it to it's current state. Linked to those in order in the README.
ChatGPT has made working with TypeScript and the VSCode extension framework really nice! None
It doesn't seem to want to show me stdout
The key seems to be placed in the expected location
But from your other answer, I think I'm understanding that you can have multiple agents on a single instance listening to the same queue.
So we could maybe initialize 4 instances of the agent on a single EC2 instance which would allow us to handle a higher volume of small batches concurrently without tying up the entire instance.
Here we go. Trying with this
Oh this is thought provoking. Yeah, the idea of using ClearML for R&D is super appealing (to me speaking as an MLOps engineer ๐ ). And having the power of Metaflow's scheduler (on Step Functions with Event Bridge since we'd do the AWS-native deployment) also makes sense to me.
I'll keep asking questions about how we could do event-based jobs with alerting built in on ClearML in a different thread later on.
I pasted your points (anonymously) onto the Metaflow slack to le...