Reputation
Badges 1
127 × Eureka!If the load balancer it Gateway can do the computation and leverage caching, weβre much safer against DDOS attacks. In general, Iβd prefer not to have our EC2 instance directly exposed to the public Internet.
Totally worked!
I'm imagining:
- The EC2 instance would be in a private subnet, accessible only on the VPN (read: VPC)
- The API Gateway and Load Balancer would also be on the VPC and therefore have access to the private subnet BUT the API Gateway or Load Balancer themselves would be exposed to the public internet.
That way, to do the JWT authentication, the load balancer or API Gateway could reach out to the EC2 instance on the private network to authenticate any incoming ClearML SDK requests.
This is a low-key open-source project if anyone wanted to contribute. Since the project is early, there are lots of high-impact things, e.g. UI polish, that would be relatively low effort π
Is there some way we could programmatically list all current ClearML sessions?
We need a way to do that, maybe with the clearml-session
CLI in order to populate the VS Code extension menu.
For these functions, Metaflow offers:
- triggering: integration with AWS event bridge. It's really easy to use Boto3 and AWS access keys to emit events for Metaflow DAGs. It's nice not to have to worry about networking for this.
- Scheduling: The fact that Metaflow uses stepfunctions is reassuring.
- observability: this lovely flame graph where you can view the logs and duration of each step in the DAG, it's easy to view all the DAG runs including the ones that have failed. Ideally, we w...
Oh wow. If this works, that will be insanely cool. Like, I guess what I'm going for is that if I specify "username: test" and "password: test" in that file, that I can specify "api.access_key: test" and "api.secret_key: test" in the clearml.conf used for CI. I'll give it a try tonight!
Sorry, clarifying:
The agent-services
entry in the docker-compose file seems to add a single worker to the services
queue
I'll try to describe the scenario I was thinking would cause ClearML to break down:
Assume:
- We've got a queue called
streaming
- We've got an S3 bucket with images landing inside
- When the images land, they go into a queue
- When there are 100 images in the queue, we trigger a ClearML pipeline to ingest, transform, run inference on the batch, and then write the results somewhere
- Let's say we get 1,000,000 images in the Bucket per hour. That might be 1,000,000 / 100 = 10,000 batches. ...
But from your other answer, I think I'm understanding that you can have multiple agents on a single instance listening to the same queue.
So we could maybe initialize 4 instances of the agent on a single EC2 instance which would allow us to handle a higher volume of small batches concurrently without tying up the entire instance.
It doesn't seem to want to show me stdout
Will do!
This thread should be immortalized. Super stoked to try this out!
Thanks Vasil! Can you elaborate on what you mean by using boto3? Do you mean writing a script using boto that pulls the credentials down and writes to the user's clearml.conf
Also, I've been seeing references to "credentials vault" in the docs. I can see this is the problem that it solves.
When you run the docker-compose.yml
on an EC2 instance, you can configure user login for the ClearML webserver. But the files API is still open to the world, right? (and same with the backend?)
We could solve this by placing the EC2 instance into a VPN.
One disadvantage to that approach is it becomes annoying to reach the model registry from outside the VPN, like if you have a deployment pipeline based in GitHub Actions. Or if you wanted to trigger a ClearML pipeline from a VPC that isn...
Disclaimer: I'm not familiar enouch with the ClearML codebase to vouch for the quality of this PR, although it is short which is typically good . The feature we're interested in is the ability to specify the subnet_id
.
I may be able to prepare a PR that only allows specifying the subnet ID. Can you help me brainstorm scenarios youβd want to see tested? Also, do these need to be automated tests?
Thank you! I think it does. Itβs just now dawning on me that: because a pipeline is composed of multiple tasks, different tasks in the pipeline could run on different machines. Or more specifically, they could run on different queues, and as you said, in your other response, we could have a Q for smaller CPU-based instances, and another queue larger GPU-based instances.
I like the idea of having a queue dedicated to CPU-based instances that has multiple agents running on it simultaneously....
One idea: is it possible to store usable credentials in advance and place them in a volume that the ClearML containers can access and then use?
Actually, dumb question: how do I set the setup script for a task?
Here's a screenshot if a session where I first try to clone as ssm-user
, but it fails, then I change to root
and it succeeds
Well wow, I figured it out. You equiped me with a solid debugging tool AKA running bash commands within the docker container.
I had to pre-add GitHub and Bitbucket to known hosts by adding keyscan
commands
configurations:
extra_clearml_conf: ""
extra_trains_conf: ""
extra_vm_bash_script: |
echo "fetching github key" && (aws ssm get-parameter --region us-west-2 --name /clearml/github_ssh_private_key --with-decryption --query Parameter.Value --output text > ~/.ssh/id_rsa &...
I can't think of any changes we might have made on our side to cause that π€
configurations:
extra_clearml_conf: ""
extra_trains_conf: ""
extra_vm_bash_script: |
aws ssm get-parameter --region us-west-2 --name /clearml/github_ssh_private_key --with-decryption --query Parameter.Value --output text > ~/.ssh/id_rsa && chmod 600 ~/.ssh/id_rsa
source /clearml_agent_venv/bin/activate
hyper_params:
iam_arn: arn:aws:iam::<my account id>:instance-profile/clearml-2-AutoscaledInstanceProfileAutoScaledEC2InstanceProfile56A5348F-90fmf6H5OUBx
I do agree with your earlier observation that the target of that mount seems wrong. I would think that the volume mount should be -v /root/.ssh:/root/.ssh
but instead it's -v /root.ssh:/.ssh
Yes, it's pretty lame that a clearml-agent
can only process one task at a time if it's not listening to a services
queue π€
Let's see. The screenshots above are me running on the host, not attaching to a running container. So I believe I do want the keys to be mounted into the running containers.
Oh hooray! So docker-compose manages the restarting of crashed containers? I didn't know that, and that is great π
OOooh, excellent. So the file server isn't necessary at all if you're using some other object storage? That's slick!
Is there a way I could move the JWT authentication (not authorization) logic into an API Gateway or Load Balancer? For example, if ClearML is following OAuth 2.0, then the load balancer or API Gateway could reach out to it's "issuer URL" (probably available on the EC2 instance where ClearML is running) like this example here.
![image](https://clearml-web-assets.s3.amazonaws.c...
So, we've been able to run sudo su
and then git clone
with our private repos a few times now