Reputation
Badges 1
124 × Eureka!My understanding may be bad. Say I have a single EC2 instance. Is that instance only able to handle one task at a time?
Or can I start multiple instances of the clearml-agent
process on it and then have one task per agent?
And if that's the case, can we have multiple agents on the EC2 instance listening to the same queue, e.g. default
. Or would this only work if they were listening to different queues?
Yes, it's pretty lame that a clearml-agent
can only process one task at a time if it's not listening to a services
queue π€
I see. Is it possible for two agents to be utilizing the same GPU? (like if the machine has a terrific GPU, but only one of them?)
As an infrastructure engineer, I feel that this is a fairly significant shortcoming of ClearML.
Having the ability to pack jobs/tasks onto the same "resource" (underlying server/EC2 instance) would
- simplify the experience for data scientists
- open up a streaming use case, wherein batch (offline) inference could be done directly inside of a ClearML pipeline in reaction to an event/trigger (like new data landing in your data lake). As it is, you can make this work, but if you start to get ...
Thank you! I think it does. Itβs just now dawning on me that: because a pipeline is composed of multiple tasks, different tasks in the pipeline could run on different machines. Or more specifically, they could run on different queues, and as you said, in your other response, we could have a Q for smaller CPU-based instances, and another queue larger GPU-based instances.
I like the idea of having a queue dedicated to CPU-based instances that has multiple agents running on it simultaneously....
This thread should be immortalized. Super stoked to try this out!
Disclaimer: I'm not familiar enouch with the ClearML codebase to vouch for the quality of this PR, although it is short which is typically good . The feature we're interested in is the ability to specify the subnet_id
.
I may be able to prepare a PR that only allows specifying the subnet ID. Can you help me brainstorm scenarios youβd want to see tested? Also, do these need to be automated tests?
Oh this is thought provoking. Yeah, the idea of using ClearML for R&D is super appealing (to me speaking as an MLOps engineer π ). And having the power of Metaflow's scheduler (on Step Functions with Event Bridge since we'd do the AWS-native deployment) also makes sense to me.
I'll keep asking questions about how we could do event-based jobs with alerting built in on ClearML in a different thread later on.
I pasted your points (anonymously) onto the Metaflow slack to le...
Thanks for the response @<1523701205467926528:profile|AgitatedDove14> !
What would you consider an event?
I was thinking of the TriggerScheduler
's definition of an event. Pretty much, any thing the TriggerSchedule allows you to react to, it'd be great to be able to publish those events to a queue external to ClearML, e.g. a tag added to a model (or removed), a state in a task changing, etc. We'd want as much metadata about that event as possible. So if the event is due to a task...
It seems you have a specific workflow in mind, but I'm not sure I follow it. Can you give a specific example ?
Absolutely. So, let's say a DS tags a model in ClearML with "release candidate". It'd be great to have that trigger a number of processes, each with their own retry logic:
- A fairness/bias evaluation, potentially as a task in ClearML itself. This would load the model and run some sample datasets through it. The
- Pipeline to prepare for deployment. Trigger a GitHub Actions ...
I could imagine other useful automations for reacting to failed tasks that have certain tags, including alerting.
I realize we could move a lot of this logic into ClearML itself: make handler functions that run within the services queue. That would work for logic that is implemented in Python. But I believe it would be harder for our team to detect and respond to failures in the event handler functions if they were placed there because it seems unclear how we could use our existing systems a...
Hi. Yes that totally makes sense. Itβs just that we donβt want the logic that does the Jenkins trigger to be in a ClearML handler or task, but rather as a handler that acts as a subscriber in a pub-sub system.
This is because we have a pub-sub architecture that we already use, it can handle retries, etc. also we will likely want multiple systems to react to notifications in the pub sub system. We already have a lot of setup for this.
I guess the conclusion is: I realize itβs possible...
I symlinked
/opt/clearml
to
/mnt/xvda/clearml
Genius! I don't think I accounted for making sure the volumes ended up in the EBS volume mount in this CDK example ^^^. And I modified the docker-compose.yml
file to point at a different location. Sym-linking is totally the route I should take if I get time to come back and clean up this repo.
Oh my goodness. Thank you! I'd seen that before, but for some reason it didn't register I could run that with VS Code...
But this config should almost never need to change!
Host clearml-session
HostName localhost
User root
Port 8022
I'm trying to add a docker-compose.yaml
to the repo to
- make it more convenient for contributors to develop locally
- spin up a local ClearML instance in CI to run automated tests
Here's the docker-compose file (mostly the standard file, except I altered the volume mounts, and I added minIO)
Here's [the clearml.conf file](https://github.com/mlops-club/vscode-clearml-sessi...
Here's a docker-compose I've been playing with. It doesn't have the same restart problem you're describing, but I did change the volume mounts: None
I've also tried running a clearml-agent daemon
directly on my mac (not in docker) serving the sessions
queue for the ClearML server that is running in docker. When I do that, it consistently fails with a different error. Something to do with mounting a volume.
The agent commands are nothing special.
clearml-agent daemon --queue sessions --cpu-only --create-queue true --docker
And for the session
clearml-session --queue sessions --docker python:3.9
Hmm... these people are recommending restarting docker completely. I may have tried that already, but I'll do it again when I get some time to be sure.
Hey! Sorry, I don't think I ever solved this for elasticsearch π
Here's the repo: I've recorded a few update videos documenting how we learned about authoring VS Code extensions and how we got it to it's current state. Linked to those in order in the README.
ChatGPT has made working with TypeScript and the VSCode extension framework really nice! None
Interesting . Itβs actually just running locally on my laptop. It seemed only to be an issue when pointing the ClearML session CLI at my local version of ClearML. Still thinking about this one.
The issue went away. I'm still not sure why, but what finally made it work was creating a set of credentials manually in the UI and then setting those in my ~/clearml.conf
file.
Do you happen to have a link to a docker-compose.yaml
file that has a hardcoded set of credentials?
I want to seed the clearml instance with a set of credentials and ~/clearml.conf
to run automated tests.
While I'm wishing for things: it'd be awesome if it had a queue already set up. But if there's not a way to do that in the docker compose file, I could potentially write a script that uses the creds to create one using API calls
Oh hooray! So docker-compose manages the restarting of crashed containers? I didn't know that, and that is great π
This is totally what I was looking for! Yeah, by "good story for offline batch" I meant, "good feature support for ..."
I bookmarked this comment. I think I'll be doing a POC trying to show this functionality within the next month.
Is there some way we could programmatically list all current ClearML sessions?
We need a way to do that, maybe with the clearml-session
CLI in order to populate the VS Code extension menu.