
Reputation
Badges 1
129 × Eureka!possibly cheaper on the cloud (Lambda vs EC2 instance)
Whoa, are you saying there's an autoscaler that doesn't use EC2 instances? I may be misunderstanding, but that would be very cool.
Maybe I should have said: my plan is to use AWS StepFunctions where a single task in the DAG is an entire ClearML pipeline . The non-ClearML steps would orchestrate putting messages into a queue, doing retry logic, and triggering said pipeline.
I think at some point, there has to be some amount of...
Here's a docker-compose I've been playing with. It doesn't have the same restart problem you're describing, but I did change the volume mounts: None
If this works, we might be able to fully replace Metaflow with ClearML!
(Refering to the feature where Metaflow creates Step Functions state machines for you, and then you can use those to trigger event-driven batch jobs in the same way described here)
you mean as experiment management / model registry / data? I think this is the bread&butter of clearml
π― . I was wondering if anyone had had experience using ClearML together with one of these others.
I think most of them are alternatives to metaflow
Totally.
Like, if you google "dagster and clearml" or "prefect and clearml" or "airflow and clearml" -- I don't find any blogs written by people talking about how they use both of them together.
That's strange to me, becau...
While I'm wishing for things: it'd be awesome if it had a queue already set up. But if there's not a way to do that in the docker compose file, I could potentially write a script that uses the creds to create one using API calls
Thank you! For now, it's kind of nice that it just picks up your credentials from your conf file. No extra setup required beyond the onboarding ClearML has you do π
And look! It's working, assuming you start the clearml session up yourself:
Is there some way we could programmatically list all current ClearML sessions?
We need a way to do that, maybe with the clearml-session
CLI in order to populate the VS Code extension menu.
To do this, I think I need to know:
- Can you trigger a pre-existing Pipeline via the ClearML REST API? I'd want to have a Lambda function trigger the Pipeline for a batch without needing to have all the Pipeline code in the lambda function. Something like
curl -u '<clearml credetials>'
None,...
- [probably a big ask] If the pipeline succeeds/fails, can ClearML emit an event that I can react to? Like mayb...
OOooh, excellent. So the file server isn't necessary at all if you're using some other object storage? That's slick!
Is there a way I could move the JWT authentication (not authorization) logic into an API Gateway or Load Balancer? For example, if ClearML is following OAuth 2.0, then the load balancer or API Gateway could reach out to it's "issuer URL" (probably available on the EC2 instance where ClearML is running) like this example here.

- The API Gateway and Load Balancer would also be on the VPC and therefore have access to the private subnet BUT the API Gateway or Load Balancer themselves would be exposed to the public internet.
That way, to do the JWT authentication, the load balancer or API Gateway could reach out to the EC2 instance on the private network to authenticate any incoming ClearML SDK requests.
I symlinked
/opt/clearml
to
/mnt/xvda/clearml
Genius! I don't think I accounted for making sure the volumes ended up in the EBS volume mount in this CDK example ^^^. And I modified the docker-compose.yml
file to point at a different location. Sym-linking is totally the route I should take if I get time to come back and clean up this repo.
Duh! I bet VS Code's Python extensions like the VS Code Black Extension would be a really good starting place. They are small and are wrappers around a Python CLI tool. I bet there's a lot we could adapt for the ClearML CLI
Thanks Vasil! Can you elaborate on what you mean by using boto3? Do you mean writing a script using boto that pulls the credentials down and writes to the user's clearml.conf
Also, I've been seeing references to "credentials vault" in the docs. I can see this is the problem that it solves.
Iβd really prefer it was modular enough to use serving with any model registry
Oh that's interesting. To serve a model from MLflow, would you have to copy it over to ClearML first?
Interesting . Itβs actually just running locally on my laptop. It seemed only to be an issue when pointing the ClearML session CLI at my local version of ClearML. Still thinking about this one.
And for the session
clearml-session --queue sessions --docker python:3.9
But I actually wish the interface were more like the apiserver.conf
file--specifically, that you can define hard-coded credentials in this file in advance. Except, I wish that you could define API keys this way (or some other way)
auth {
# Fixed users login credentials
# No other user will be able to login
fixed_users {
enabled: true
pass_hashed: false
users: [
{
username: "test"
password: "test"
...
Hey! Sorry, I don't think I ever solved this for elasticsearch π
How it works / what we finished:
- We used the SaaS ClearML, started an EC2 instance, and manually installed and ran the
clearml-agent daemon
on it - We ran
clearml-init
on our laptops to generate theclearml.conf
file. - The extension is in TypeScript, so...
- We started trying to write code with the Python SDK to list sessions, but realized calling that from the extension would be hard, so we opted to have the TypeScript code make calls to the ClearML API server directly, e.g. ...
@<1523701205467926528:profile|AgitatedDove14> you beautiful person, this is terrific! I do believe SageMaker has some nice monitoring/data drift capabilities that seem interesting, but these points you have here will be a fantastic starting point for my team's analysis of the products. I think this will help balance some of the over-enthusiasm towards using the native AWS solution.
Oh, that is cool. I captured all this. Maybe I'll make a user-data.sh
script and docker-compose.yml
file that brings all these things together. Probably won't have time for a few weeks.
I SOLVED IT, NO NEED TO READ FURTHER π
I'm a chump and didn't read the docs: None
Oh, I think I got overexcited and didn't look at this closely. So this ACCESS/SECRET key pair is on the agent-services
container.
I can see that agent-services
is simply a container running `clearml-agent daemon --queue ser...
That's with the key at /root/.ssh/id_rsa
Hey @<1523701482157772800:profile|AnxiousSeal95> ! I think ClearML's orchestrator is a great fit for ad-hoc experimentation, but not for (event-triggered) batch inference jobs that need to be relied on in production.
I'd only feel comfortable supporting pipelines that serve end users on a tool that is known for that, e.g. Metaflow, Dagster, or Airflow--mainly because those tools emphasize good monitoring and integration with the wider data ecosystem.
My understanding may be bad. Say I have a single EC2 instance. Is that instance only able to handle one task at a time?
Or can I start multiple instances of the clearml-agent
process on it and then have one task per agent?
And if that's the case, can we have multiple agents on the EC2 instance listening to the same queue, e.g. default
. Or would this only work if they were listening to different queues?
I don't know about this, but could you turn your whole project into a pip-installable package using a setup.py
and/or pyproject.toml
?
I've never tried this, but maybe then you could do pip install -e .
locally before executing the task. Then execute. And then maybe the pip freeze
that ClearML does would contain the symlink to your directory.
(so that from my_package import ...
statements would work)
I see. Is it possible for two agents to be utilizing the same GPU? (like if the machine has a terrific GPU, but only one of them?)