Reputation
Badges 1
124 × Eureka!Ah, but it's probably worth noting that the docker-compose.yml
does register the EC2 isntance that the server is running on as an agent listening on the services
queue, so ongoing tasks in that queue that happen to be placed on the server would get terminated when docker-compose down
is run.
Yay! Man, I want to do ClearML with "hard mode" (non-enterprise, self-hosted) first, before trying to sell BENlabs (my work) on it. I could see us paying for enterprise to get the Hyper Datasets and Vault features if our scientists/developers fall in love with it--they probably will if we can get them to adopt it since right now we have a homemade system that isn't nearly as nice as ClearML.
@<1523701087100473344:profile|SuccessfulKoala55> how exactly do you configure ClearML to use the cr...
At the time that I run python aws_autoscaler.py --remote
, that clearml-services
worker is the only worker on the services
queue. So it will be the worker that picks up the autoscaler task.
But the task seems to be failing on startup due to the CLEARML_API_HOST
not being set, but it is set for the docker container that the agent is running on.
Here's the full autoscaler log where the failure happens if that's helpful.
My understanding may be bad. Say I have a single EC2 instance. Is that instance only able to handle one task at a time?
Or can I start multiple instances of the clearml-agent
process on it and then have one task per agent?
And if that's the case, can we have multiple agents on the EC2 instance listening to the same queue, e.g. default
. Or would this only work if they were listening to different queues?
Let's see. The task log? I think this is it.
I see. Is it possible for two agents to be utilizing the same GPU? (like if the machine has a terrific GPU, but only one of them?)
Is there a way we can protect a ClearML deployment with a load balancer or API Gateway that is exposed to the whole world, but is protected by authentication so that only authorized clients can get in?
When you run the docker-compose.yml
on an EC2 instance, you can configure user login for the ClearML webserver. But the files API is still open to the world, right? (and same with the backend?)
We could solve this by placing the EC2 instance into a VPN.
One disadvantage to that approach is it becomes annoying to reach the model registry from outside the VPN, like if you have a deployment pipeline based in GitHub Actions. Or if you wanted to trigger a ClearML pipeline from a VPC that isn...
If the load balancer it Gateway can do the computation and leverage caching, weβre much safer against DDOS attacks. In general, Iβd prefer not to have our EC2 instance directly exposed to the public Internet.
We should put a $100 bounty on a bash script that backs up and restores mongodb, redis, and ES, etc. to S3 using the most resiliant ways π
So the problem came back even with this new URL. I discovered clearing your cookies fixes it.
You have no idea what is committed to disk vs what is still contained in memory.
If you ran docker-compose down
and allowed ES to gracefully shut down, would ES finish writing everything to disk, therefore guaranteeing that the backups wouldn't get corrupted?
But I actually wish the interface were more like the apiserver.conf
file--specifically, that you can define hard-coded credentials in this file in advance. Except, I wish that you could define API keys this way (or some other way)
auth {
# Fixed users login credentials
# No other user will be able to login
fixed_users {
enabled: true
pass_hashed: false
users: [
{
username: "test"
password: "test"
...
Hey, thanks for responding!
Does there happen to be ClearML auto-logging... for MLFlow? That would make it super easy for us to migrate our existing training/batch inference jobs to ClearML π
I symlinked
/opt/clearml
to
/mnt/xvda/clearml
Genius! I don't think I accounted for making sure the volumes ended up in the EBS volume mount in this CDK example ^^^. And I modified the docker-compose.yml
file to point at a different location. Sym-linking is totally the route I should take if I get time to come back and clean up this repo.
That could work! Is that an option? Something that lets me spin up the ClearML and get a services worker to connect to it without manual steps.
So, we've already got a model registry: MLFlow
And we've got a serving framework we really like: BentoML and FastAPI
The debate is between ClearML and Metaflow for
- training models during the research phase
- re-training models on a schedule or event-based trigger in production
- running batch inference jobs on a schedule or event-based trigger in production
Caching can be a reason. Say you do some heavy data loading / processing in step 1. Now you're developing step 2.
It'd be nice not to have to re-run Step 1 every time you want to test a change to step 2.
You could find a way to simply write your output of step1 to disk and do everything in one step, or you could let ClearML handle that caching for you--with the added benefit that others collaborating remotely can also use the outputs of steps you've cached with ClearML
Thank you! For now, it's kind of nice that it just picks up your credentials from your conf file. No extra setup required beyond the onboarding ClearML has you do π
And look! It's working, assuming you start the clearml session up yourself:
Yes, it's pretty lame that a clearml-agent
can only process one task at a time if it's not listening to a services
queue π€
OOooh, excellent. So the file server isn't necessary at all if you're using some other object storage? That's slick!
Is there a way I could move the JWT authentication (not authorization) logic into an API Gateway or Load Balancer? For example, if ClearML is following OAuth 2.0, then the load balancer or API Gateway could reach out to it's "issuer URL" (probably available on the EC2 instance where ClearML is running) like this example here.
![image](https://clearml-web-assets.s3.amazonaws.c...
Here's the repo: I've recorded a few update videos documenting how we learned about authoring VS Code extensions and how we got it to it's current state. Linked to those in order in the README.
ChatGPT has made working with TypeScript and the VSCode extension framework really nice! None
I've also tried running a clearml-agent daemon
directly on my mac (not in docker) serving the sessions
queue for the ClearML server that is running in docker. When I do that, it consistently fails with a different error. Something to do with mounting a volume.
The agent commands are nothing special.
clearml-agent daemon --queue sessions --cpu-only --create-queue true --docker
And for the session
clearml-session --queue sessions --docker python:3.9
Oh wow. If this works, that will be insanely cool. Like, I guess what I'm going for is that if I specify "username: test" and "password: test" in that file, that I can specify "api.access_key: test" and "api.secret_key: test" in the clearml.conf used for CI. I'll give it a try tonight!
I'll search around some more when I get time. I have no idea, but it feels like ClearML has already done the hard part which is creating clearml-session
in the first place.
This could be a really low-hanging OSS contribution that could make a real impact π .