Reputation
Badges 1
124 × Eureka!Oh duh, thanks. What about non standard entrypoints (as opposed to arguments) like accelerate launch train.py
?
I can't think of any changes we might have made on our side to cause that 🤔
Sorry, clarifying:
The agent-services
entry in the docker-compose file seems to add a single worker to the services
queue
Oh my goodness. Thank you! I'd seen that before, but for some reason it didn't register I could run that with VS Code...
But this config should almost never need to change!
Host clearml-session
HostName localhost
User root
Port 8022
Oh! System tags! That would definitely have been a better way to do it. We ended up querying for tasks in the "DevOps" project with the name "Interactive Session"
I’d really prefer it was modular enough to use serving with any model registry
Oh that's interesting. To serve a model from MLflow, would you have to copy it over to ClearML first?
So, we've been able to run sudo su
and then git clone
with our private repos a few times now
Or the log of the init script?
Thanks Vasil! Can you elaborate on what you mean by using boto3? Do you mean writing a script using boto that pulls the credentials down and writes to the user's clearml.conf
Also, I've been seeing references to "credentials vault" in the docs. I can see this is the problem that it solves.
I'm imagining:
- The EC2 instance would be in a private subnet, accessible only on the VPN (read: VPC)
- The API Gateway and Load Balancer would also be on the VPC and therefore have access to the private subnet BUT the API Gateway or Load Balancer themselves would be exposed to the public internet.
That way, to do the JWT authentication, the load balancer or API Gateway could reach out to the EC2 instance on the private network to authenticate any incoming ClearML SDK requests.
Duh! I bet VS Code's Python extensions like the VS Code Black Extension would be a really good starting place. They are small and are wrappers around a Python CLI tool. I bet there's a lot we could adapt for the ClearML CLI
This is a low-key open-source project if anyone wanted to contribute. Since the project is early, there are lots of high-impact things, e.g. UI polish, that would be relatively low effort 😄
Oh awesome @<1523701132025663488:profile|SlimyElephant79> ! If you want to take a look, I made a big list of things to add. I'm working on a docker-compose.yaml
file so we can have a good local development environment.
There's a lot of room to improve this from cleaning up the code to adding features on the list.
I think it will work. There's a lot of really useful code in the black extension. I'm recruiting people now to join in on Friday. I'm actually very confident about it after messing around.
Oh there's parallelization as well. You could have step 1 gather the data, and then fan out to N parallel steps that all do different things with the data, for example hyper parameter tuning
Hey! Sorry, I don't think I ever solved this for elasticsearch 😕
Thanks for replying Martin! (as always)
Do you think ClearML is a strong option for running event-based training and batch inference jobs in production? That’d include monitoring and alerting. I’m afraid that Metaflow will look far more compelling to our teams for that reason.
Since it deploys onto step functions, the scheduling is managed for you and I believe alerts for failing jobs can be set up without adding custom code to every pipeline.
If that’s the case, then we’d probably only...
In a future iteration, it'd be cool if you could configure presets. Like maybe you have an on-startup.sh
script you really like using to set up your instance, and VS Code extensions you want to pass to the --install-extensions ...
flag
Yay! Man, I want to do ClearML with "hard mode" (non-enterprise, self-hosted) first, before trying to sell BENlabs (my work) on it. I could see us paying for enterprise to get the Hyper Datasets and Vault features if our scientists/developers fall in love with it--they probably will if we can get them to adopt it since right now we have a homemade system that isn't nearly as nice as ClearML.
@<1523701087100473344:profile|SuccessfulKoala55> how exactly do you configure ClearML to use the cr...
At the time that I run python aws_autoscaler.py --remote
, that clearml-services
worker is the only worker on the services
queue. So it will be the worker that picks up the autoscaler task.
But the task seems to be failing on startup due to the CLEARML_API_HOST
not being set, but it is set for the docker container that the agent is running on.
Here's the full autoscaler log where the failure happens if that's helpful.
My understanding may be bad. Say I have a single EC2 instance. Is that instance only able to handle one task at a time?
Or can I start multiple instances of the clearml-agent
process on it and then have one task per agent?
And if that's the case, can we have multiple agents on the EC2 instance listening to the same queue, e.g. default
. Or would this only work if they were listening to different queues?
Let's see. The task log? I think this is it.
I see. Is it possible for two agents to be utilizing the same GPU? (like if the machine has a terrific GPU, but only one of them?)
Is there a way we can protect a ClearML deployment with a load balancer or API Gateway that is exposed to the whole world, but is protected by authentication so that only authorized clients can get in?
When you run the docker-compose.yml
on an EC2 instance, you can configure user login for the ClearML webserver. But the files API is still open to the world, right? (and same with the backend?)
We could solve this by placing the EC2 instance into a VPN.
One disadvantage to that approach is it becomes annoying to reach the model registry from outside the VPN, like if you have a deployment pipeline based in GitHub Actions. Or if you wanted to trigger a ClearML pipeline from a VPC that isn...
If the load balancer it Gateway can do the computation and leverage caching, we’re much safer against DDOS attacks. In general, I’d prefer not to have our EC2 instance directly exposed to the public Internet.