
Reputation
Badges 1
129 × Eureka!I SOLVED IT, NO NEED TO READ FURTHER π
I'm a chump and didn't read the docs: None
Oh, I think I got overexcited and didn't look at this closely. So this ACCESS/SECRET key pair is on the agent-services
container.
I can see that agent-services
is simply a container running `clearml-agent daemon --queue ser...
you mean as experiment management / model registry / data? I think this is the bread&butter of clearml
π― . I was wondering if anyone had had experience using ClearML together with one of these others.
I think most of them are alternatives to metaflow
Totally.
Like, if you google "dagster and clearml" or "prefect and clearml" or "airflow and clearml" -- I don't find any blogs written by people talking about how they use both of them together.
That's strange to me, becau...
I may be able to prepare a PR that only allows specifying the subnet ID. Can you help me brainstorm scenarios youβd want to see tested? Also, do these need to be automated tests?
That is great! This is all the motivation I needed to decide to do a POC at some point.
Is there some way we could programmatically list all current ClearML sessions?
We need a way to do that, maybe with the clearml-session
CLI in order to populate the VS Code extension menu.
Oh! System tags! That would definitely have been a better way to do it. We ended up querying for tasks in the "DevOps" project with the name "Interactive Session"
Iβd really prefer it was modular enough to use serving with any model registry
Oh that's interesting. To serve a model from MLflow, would you have to copy it over to ClearML first?
Oh hooray! So docker-compose manages the restarting of crashed containers? I didn't know that, and that is great π
I'm not seeing a extra_docker_shell_script
in my clearml.conf generated by clearml-agent init
like in this guide
Interesting . Itβs actually just running locally on my laptop. It seemed only to be an issue when pointing the ClearML session CLI at my local version of ClearML. Still thinking about this one.
Yeah, I believe all VS Code Extensions are in TypeScript. My main point was that this is an example of a VS Code extension that executes a Python CLI.
I've also used Airflow and Dagster in prod, but not integrated them with an exp tracker.
I'll try to describe the scenario I was thinking would cause ClearML to break down:
Assume:
- We've got a queue called
streaming
- We've got an S3 bucket with images landing inside
- When the images land, they go into a queue
- When there are 100 images in the queue, we trigger a ClearML pipeline to ingest, transform, run inference on the batch, and then write the results somewhere
- Let's say we get 1,000,000 images in the Bucket per hour. That might be 1,000,000 / 100 = 10,000 batches. ...
But from your other answer, I think I'm understanding that you can have multiple agents on a single instance listening to the same queue.
So we could maybe initialize 4 instances of the agent on a single EC2 instance which would allow us to handle a higher volume of small batches concurrently without tying up the entire instance.
As an infrastructure engineer, I feel that this is a fairly significant shortcoming of ClearML.
Having the ability to pack jobs/tasks onto the same "resource" (underlying server/EC2 instance) would
- simplify the experience for data scientists
- open up a streaming use case, wherein batch (offline) inference could be done directly inside of a ClearML pipeline in reaction to an event/trigger (like new data landing in your data lake). As it is, you can make this work, but if you start to get ...
I could imagine other useful automations for reacting to failed tasks that have certain tags, including alerting.
I realize we could move a lot of this logic into ClearML itself: make handler functions that run within the services queue. That would work for logic that is implemented in Python. But I believe it would be harder for our team to detect and respond to failures in the event handler functions if they were placed there because it seems unclear how we could use our existing systems a...
For now, I've written a headless selenium script to generate credentials for the fresh ClearML instance in CI.
We should put a $100 bounty on a bash script that backs up and restores mongodb, redis, and ES, etc. to S3 using the most resiliant ways π
In a future iteration, it'd be cool if you could configure presets. Like maybe you have an on-startup.sh
script you really like using to set up your instance, and VS Code extensions you want to pass to the --install-extensions ...
flag
The agent commands are nothing special.
clearml-agent daemon --queue sessions --cpu-only --create-queue true --docker
Thank you! For now, it's kind of nice that it just picks up your credentials from your conf file. No extra setup required beyond the onboarding ClearML has you do π
And look! It's working, assuming you start the clearml session up yourself:
Disclaimer: I'm not familiar enouch with the ClearML codebase to vouch for the quality of this PR, although it is short which is typically good . The feature we're interested in is the ability to specify the subnet_id
.
The key seems to be placed in the expected location
Let's see. The screenshots above are me running on the host, not attaching to a running container. So I believe I do want the keys to be mounted into the running containers.
The question I'm exploring remains: is it possible to acquire that initial set of ClearML API keys programmatically so that the manual steps of 1-4 above can be avoided for an initial deployment?
I literally just ran into this minutes ago and was about to file a bug report. A colleague ran into the same problem. It looks like urllib3 upgraded to v2 last week.
Thanks for this!! I may try it and if I do and it works Iβll look into writing a plugin for ZenML and Metaflow that auto initializes the parent task and registers the steps as child tasks. Super helpful thank you!
Hey @<1523701482157772800:profile|AnxiousSeal95> ! I think ClearML's orchestrator is a great fit for ad-hoc experimentation, but not for (event-triggered) batch inference jobs that need to be relied on in production.
I'd only feel comfortable supporting pipelines that serve end users on a tool that is known for that, e.g. Metaflow, Dagster, or Airflow--mainly because those tools emphasize good monitoring and integration with the wider data ecosystem.