
Reputation
Badges 1
129 × Eureka!Here's a screenshot if a session where I first try to clone as ssm-user
, but it fails, then I change to root
and it succeeds
But I actually wish the interface were more like the apiserver.conf
file--specifically, that you can define hard-coded credentials in this file in advance. Except, I wish that you could define API keys this way (or some other way)
auth {
# Fixed users login credentials
# No other user will be able to login
fixed_users {
enabled: true
pass_hashed: false
users: [
{
username: "test"
password: "test"
...
One idea: is it possible to store usable credentials in advance and place them in a volume that the ClearML containers can access and then use?
It seems you have a specific workflow in mind, but I'm not sure I follow it. Can you give a specific example ?
Absolutely. So, let's say a DS tags a model in ClearML with "release candidate". It'd be great to have that trigger a number of processes, each with their own retry logic:
- A fairness/bias evaluation, potentially as a task in ClearML itself. This would load the model and run some sample datasets through it. The
- Pipeline to prepare for deployment. Trigger a GitHub Actions ...
It's an Amazon Linux AMI with the AWS CLI pre-installed on it. It uses the AWS CLI to fetch the key from AWS SSM Parameter Store. It's granted read access to that SSM Parameter via the instance role.
So, we've been able to run sudo su
and then git clone
with our private repos a few times now
Thank you! I think it does. Itโs just now dawning on me that: because a pipeline is composed of multiple tasks, different tasks in the pipeline could run on different machines. Or more specifically, they could run on different queues, and as you said, in your other response, we could have a Q for smaller CPU-based instances, and another queue larger GPU-based instances.
I like the idea of having a queue dedicated to CPU-based instances that has multiple agents running on it simultaneously....
This thread should be immortalized. Super stoked to try this out!
If the load balancer it Gateway can do the computation and leverage caching, weโre much safer against DDOS attacks. In general, Iโd prefer not to have our EC2 instance directly exposed to the public Internet.
@<1557175205510516736:profile|ShallowSwan53> at this point, I think this question deserves it's own thread. I'm curious about it too!
So the problem came back even with this new URL. I discovered clearing your cookies fixes it.
Trying as a python subprocess...
Hi friends, I'm just seeing these new messages. I read these links and I agree with @<1557175205510516736:profile|ShallowSwan53> . It's nice that the webapp has these pages, but what is the workflow to actually use this registry?
Also, @<1557175205510516736:profile|ShallowSwan53> , do you have a specific workflow in mind that you're hoping to get from ClearML?
At BEN, we're experimenting with
- BentoML for model serving. It's a Python REST framework a lot like FastAPI, but with some nice...
^^^ For my own notes: this is the web request made by the frontend to create a set of credentials
Oh this is thought provoking. Yeah, the idea of using ClearML for R&D is super appealing (to me speaking as an MLOps engineer ๐ ). And having the power of Metaflow's scheduler (on Step Functions with Event Bridge since we'd do the AWS-native deployment) also makes sense to me.
I'll keep asking questions about how we could do event-based jobs with alerting built in on ClearML in a different thread later on.
I pasted your points (anonymously) onto the Metaflow slack to le...
cc: @<1565509803839590400:profile|MoodyBear54>
I don't know about this, but could you turn your whole project into a pip-installable package using a setup.py
and/or pyproject.toml
?
I've never tried this, but maybe then you could do pip install -e .
locally before executing the task. Then execute. And then maybe the pip freeze
that ClearML does would contain the symlink to your directory.
(so that from my_package import ...
statements would work)
I did a quick local experiment and observed that credentials created from the UI indeed become invalid if you delete the ClearML volumes.
- starting docker-compose locally
- creating a set of credentials from the UI
- hardcodign those credentials into the docker-compose file
- restarting
- the
agent-services
container started up and successfully became a registered worker - I killed the docker-compose and deleted the volume folders
- restarted the docker-compose (with the same hard-coded...
Wow, that is seriously impressive.
Oh, that is cool. I captured all this. Maybe I'll make a user-data.sh
script and docker-compose.yml
file that brings all these things together. Probably won't have time for a few weeks.
You know, you could probably add some immortal containers to the docker-compose.yml
that use images with mongodump
and the ES equivalent installed.
The container(s) could have a bash script with a while loop in it that sleeps for 30 minutes and then does a backup. If you installed the AWS CLI inside, it could even take care of uploading to S3.
I like this idea, because docker-compose.yml
could make sure that if the backup container ever dies, it would be restarted.
Earlier in the thread they mentioned that the agents are all resilient. So no ongoing tasks should be lost. I imagine even in a large organization, you could afford 5-10 minutes of downtime at 2AM or something.
That said, you'd only have 1 backup per day which could be a big deal depending on the experiments your running. You might want more than that.
Ah, but it's probably worth noting that the docker-compose.yml
does register the EC2 isntance that the server is running on as an agent listening on the services
queue, so ongoing tasks in that queue that happen to be placed on the server would get terminated when docker-compose down
is run.
You have no idea what is committed to disk vs what is still contained in memory.
If you ran docker-compose down
and allowed ES to gracefully shut down, would ES finish writing everything to disk, therefore guaranteeing that the backups wouldn't get corrupted?
@<1523701070390366208:profile|CostlyOstrich36> Oh thatโs smart. Is that to make sure no transactions happen during the backup? Would there be a risk of ongoing or pending tasks somehow getting corrupted if you shut the server down?
As opposed to using CRON or something ๐คฃ
I'm imagining:
- The EC2 instance would be in a private subnet, accessible only on the VPN (read: VPC)
- The API Gateway and Load Balancer would also be on the VPC and therefore have access to the private subnet BUT the API Gateway or Load Balancer themselves would be exposed to the public internet.
That way, to do the JWT authentication, the load balancer or API Gateway could reach out to the EC2 instance on the private network to authenticate any incoming ClearML SDK requests.
I have the same behavior whether or not I put task.execute_remotely(...)
before or after the call to run_shell_script()