But I actually wish the interface were more like the apiserver.conf
file--specifically, that you can define hard-coded credentials in this file in advance. Except, I wish that you could define API keys this way (or some other way)
auth {
# Fixed users login credentials
# No other user will be able to login
fixed_users {
enabled: true
pass_hashed: false
users: [
{
username: "test"
password: "test"
...
One idea: is it possible to store usable credentials in advance and place them in a volume that the ClearML containers can access and then use?
It seems you have a specific workflow in mind, but I'm not sure I follow it. Can you give a specific example ?
Absolutely. So, let's say a DS tags a model in ClearML with "release candidate". It'd be great to have that trigger a number of processes, each with their own retry logic:
- A fairness/bias evaluation, potentially as a task in ClearML itself. This would load the model and run some sample datasets through it. The
- Pipeline to prepare for deployment. Trigger a GitHub Actions ...
It's an Amazon Linux AMI with the AWS CLI pre-installed on it. It uses the AWS CLI to fetch the key from AWS SSM Parameter Store. It's granted read access to that SSM Parameter via the instance role.
So, we've been able to run sudo su
and then git clone
with our private repos a few times now
If the load balancer it Gateway can do the computation and leverage caching, we’re much safer against DDOS attacks. In general, I’d prefer not to have our EC2 instance directly exposed to the public Internet.
@<1557175205510516736:profile|ShallowSwan53> at this point, I think this question deserves it's own thread. I'm curious about it too!
So the problem came back even with this new URL. I discovered clearing your cookies fixes it.
Trying as a python subprocess...
Hi friends, I'm just seeing these new messages. I read these links and I agree with @<1557175205510516736:profile|ShallowSwan53> . It's nice that the webapp has these pages, but what is the workflow to actually use this registry?
Also, @<1557175205510516736:profile|ShallowSwan53> , do you have a specific workflow in mind that you're hoping to get from ClearML?
At BEN, we're experimenting with
- BentoML for model serving. It's a Python REST framework a lot like FastAPI, but with some nice...
^^^ For my own notes: this is the web request made by the frontend to create a set of credentials
Oh this is thought provoking. Yeah, the idea of using ClearML for R&D is super appealing (to me speaking as an MLOps engineer 😆 ). And having the power of Metaflow's scheduler (on Step Functions with Event Bridge since we'd do the AWS-native deployment) also makes sense to me.
I'll keep asking questions about how we could do event-based jobs with alerting built in on ClearML in a different thread later on.
I pasted your points (anonymously) onto the Metaflow slack to le...
cc: @<1565509803839590400:profile|MoodyBear54>
I don't know about this, but could you turn your whole project into a pip-installable package using a setup.py
and/or pyproject.toml
?
I've never tried this, but maybe then you could do pip install -e .
locally before executing the task. Then execute. And then maybe the pip freeze
that ClearML does would contain the symlink to your directory.
(so that from my_package import ...
statements would work)
I did a quick local experiment and observed that credentials created from the UI indeed become invalid if you delete the ClearML volumes.
- starting docker-compose locally
- creating a set of credentials from the UI
- hardcodign those credentials into the docker-compose file
- restarting
- the
agent-services
container started up and successfully became a registered worker - I killed the docker-compose and deleted the volume folders
- restarted the docker-compose (with the same hard-coded...
Wow, that is seriously impressive.
Oh, that is cool. I captured all this. Maybe I'll make a user-data.sh
script and docker-compose.yml
file that brings all these things together. Probably won't have time for a few weeks.
Earlier in the thread they mentioned that the agents are all resilient. So no ongoing tasks should be lost. I imagine even in a large organization, you could afford 5-10 minutes of downtime at 2AM or something.
That said, you'd only have 1 backup per day which could be a big deal depending on the experiments your running. You might want more than that.
Ah, but it's probably worth noting that the docker-compose.yml
does register the EC2 isntance that the server is running on as an agent listening on the services
queue, so ongoing tasks in that queue that happen to be placed on the server would get terminated when docker-compose down
is run.
You have no idea what is committed to disk vs what is still contained in memory.
If you ran docker-compose down
and allowed ES to gracefully shut down, would ES finish writing everything to disk, therefore guaranteeing that the backups wouldn't get corrupted?
@<1523701070390366208:profile|CostlyOstrich36> Oh that’s smart. Is that to make sure no transactions happen during the backup? Would there be a risk of ongoing or pending tasks somehow getting corrupted if you shut the server down?
As opposed to using CRON or something 🤣
I'm imagining:
- The EC2 instance would be in a private subnet, accessible only on the VPN (read: VPC)
- The API Gateway and Load Balancer would also be on the VPC and therefore have access to the private subnet BUT the API Gateway or Load Balancer themselves would be exposed to the public internet.
That way, to do the JWT authentication, the load balancer or API Gateway could reach out to the EC2 instance on the private network to authenticate any incoming ClearML SDK requests.
I have the same behavior whether or not I put task.execute_remotely(...)
before or after the call to run_shell_script()
Sorry, clarifying:
The agent-services
entry in the docker-compose file seems to add a single worker to the services
queue
So, we've already got a model registry: MLFlow
And we've got a serving framework we really like: BentoML and FastAPI
The debate is between ClearML and Metaflow for
- training models during the research phase
- re-training models on a schedule or event-based trigger in production
- running batch inference jobs on a schedule or event-based trigger in production
And for the session
clearml-session --queue sessions --docker python:3.9
configurations:
extra_clearml_conf: ""
extra_trains_conf: ""
extra_vm_bash_script: |
aws ssm get-parameter --region us-west-2 --name /clearml/github_ssh_private_key --with-decryption --query Parameter.Value --output text > ~/.ssh/id_rsa && chmod 600 ~/.ssh/id_rsa
source /clearml_agent_venv/bin/activate
hyper_params:
iam_arn: arn:aws:iam::<my account id>:instance-profile/clearml-2-AutoscaledInstanceProfileAutoScaledEC2InstanceProfile56A5348F-90fmf6H5OUBx
For these functions, Metaflow offers:
- triggering: integration with AWS event bridge. It's really easy to use Boto3 and AWS access keys to emit events for Metaflow DAGs. It's nice not to have to worry about networking for this.
- Scheduling: The fact that Metaflow uses stepfunctions is reassuring.
- observability: this lovely flame graph where you can view the logs and duration of each step in the DAG, it's easy to view all the DAG runs including the ones that have failed. Ideally, we w...