Totally worked!
Oh interesting. Is the hope that doing that would somehow result in being able to use those credentials to make authenticated API calls?
Oh there's parallelization as well. You could have step 1 gather the data, and then fan out to N parallel steps that all do different things with the data, for example hyper parameter tuning
Earlier in the thread they mentioned that the agents are all resilient. So no ongoing tasks should be lost. I imagine even in a large organization, you could afford 5-10 minutes of downtime at 2AM or something.
That said, you'd only have 1 backup per day which could be a big deal depending on the experiments your running. You might want more than that.
Oh wow. If this works, that will be insanely cool. Like, I guess what I'm going for is that if I specify "username: test" and "password: test" in that file, that I can specify "api.access_key: test" and "api.secret_key: test" in the clearml.conf used for CI. I'll give it a try tonight!
When you run the docker-compose.yml
on an EC2 instance, you can configure user login for the ClearML webserver. But the files API is still open to the world, right? (and same with the backend?)
We could solve this by placing the EC2 instance into a VPN.
One disadvantage to that approach is it becomes annoying to reach the model registry from outside the VPN, like if you have a deployment pipeline based in GitHub Actions. Or if you wanted to trigger a ClearML pipeline from a VPC that isn...
Yay! Man, I want to do ClearML with "hard mode" (non-enterprise, self-hosted) first, before trying to sell BENlabs (my work) on it. I could see us paying for enterprise to get the Hyper Datasets and Vault features if our scientists/developers fall in love with it--they probably will if we can get them to adopt it since right now we have a homemade system that isn't nearly as nice as ClearML.
@<1523701087100473344:profile|SuccessfulKoala55> how exactly do you configure ClearML to use the cr...
I'll search around some more when I get time. I have no idea, but it feels like ClearML has already done the hard part which is creating clearml-session
in the first place.
This could be a really low-hanging OSS contribution that could make a real impact 😄 .
That is great! This is all the motivation I needed to decide to do a POC at some point.
Oh awesome @<1523701132025663488:profile|SlimyElephant79> ! If you want to take a look, I made a big list of things to add. I'm working on a docker-compose.yaml
file so we can have a good local development environment.
There's a lot of room to improve this from cleaning up the code to adding features on the list.
At the time that I run python aws_autoscaler.py --remote
, that clearml-services
worker is the only worker on the services
queue. So it will be the worker that picks up the autoscaler task.
But the task seems to be failing on startup due to the CLEARML_API_HOST
not being set, but it is set for the docker container that the agent is running on.
Here's the full autoscaler log where the failure happens if that's helpful.
Will do!
Man, I owe you lunch sometime @<1523701205467926528:profile|AgitatedDove14> . Thanks for being so detailed in your answers.
Okay! So the pipeline ID is really just a task ID. So cool!
Not sure I fully understand what you mean here...
Sorry, I'll try again. Here's an illustrated example with AWS Step Functions (pretend this is a ClearML pipeline). If the pipeline fails, I'd want to have a chance to do some logic to react to that. Maybe in a step called "on_pipeline_failed" or someth...
You have no idea what is committed to disk vs what is still contained in memory.
If you ran docker-compose down
and allowed ES to gracefully shut down, would ES finish writing everything to disk, therefore guaranteeing that the backups wouldn't get corrupted?
I’d really prefer it was modular enough to use serving with any model registry
Oh that's interesting. To serve a model from MLflow, would you have to copy it over to ClearML first?
@<1594863216222015488:profile|ConvincingGrasshopper20> throwing this out there... would you want to make this with me at the Hackathon??
I don't know that you'd have to pre-build credentials into docker. If you could specify a set of credentials as environment variables to the docker run ...
command or something, that would work just fine.
The goal is to be able to run docker-compose up
in CI, which starts a clearml-server. And then make several API calls to the started ClearML server to prove that the VS Code extension code is working.
Examples:
- Assert that the extension can auth with ClearML
- Assert that the ext...
Hmm... these people are recommending restarting docker completely. I may have tried that already, but I'll do it again when I get some time to be sure.
Thanks for replying Martin! (as always)
Do you think ClearML is a strong option for running event-based training and batch inference jobs in production? That’d include monitoring and alerting. I’m afraid that Metaflow will look far more compelling to our teams for that reason.
Since it deploys onto step functions, the scheduling is managed for you and I believe alerts for failing jobs can be set up without adding custom code to every pipeline.
If that’s the case, then we’d probably only...
Interesting . It’s actually just running locally on my laptop. It seemed only to be an issue when pointing the ClearML session CLI at my local version of ClearML. Still thinking about this one.
The issue went away. I'm still not sure why, but what finally made it work was creating a set of credentials manually in the UI and then setting those in my ~/clearml.conf
file.
Do you happen to have a link to a docker-compose.yaml
file that has a hardcoded set of credentials?
I want to seed the clearml instance with a set of credentials and ~/clearml.conf
to run automated tests.
So, we've already got a model registry: MLFlow
And we've got a serving framework we really like: BentoML and FastAPI
The debate is between ClearML and Metaflow for
- training models during the research phase
- re-training models on a schedule or event-based trigger in production
- running batch inference jobs on a schedule or event-based trigger in production
While I'm wishing for things: it'd be awesome if it had a queue already set up. But if there's not a way to do that in the docker compose file, I could potentially write a script that uses the creds to create one using API calls
For these functions, Metaflow offers:
- triggering: integration with AWS event bridge. It's really easy to use Boto3 and AWS access keys to emit events for Metaflow DAGs. It's nice not to have to worry about networking for this.
- Scheduling: The fact that Metaflow uses stepfunctions is reassuring.
- observability: this lovely flame graph where you can view the logs and duration of each step in the DAG, it's easy to view all the DAG runs including the ones that have failed. Ideally, we w...