![Profile picture](https://clearml-web-assets.s3.amazonaws.com/scoold/avatars/BattyCrocodile47.png)
Reputation
Badges 1
129 × Eureka!cc: @<1565509803839590400:profile|MoodyBear54>
Yes, it's pretty lame that a clearml-agent
can only process one task at a time if it's not listening to a services
queue π€
I took a stab at writing an automated trigger to handle this. The goal is: anytime a pipeline succeeds or fails, let AWS know so that the input records can be placed onto a retry queue (or not)
I'm trying to get a trigger to work in general, and then I'll add the more complex AWS logic. But I seem to be missing a step somewhere:
I wrote a file called set_triggers.py
from clearml.automation.trigger import TriggerScheduler
TRIGGER_SCHEDULER = TriggerScheduler()
from pprint import...
I took a look
- I think the Outerbounds extension (the one in my screenshot) is currently closed source. That makes sense to me. A bit sad because it is highly similar.
- Another example could be the AWS ToolKit extension. But sadly, it's hardly a "minimal example". I was thinking it's relevant because it uses your local
~/.aws/
folder, which is similar to what we'd want to do.
Thank you! I think it does. Itβs just now dawning on me that: because a pipeline is composed of multiple tasks, different tasks in the pipeline could run on different machines. Or more specifically, they could run on different queues, and as you said, in your other response, we could have a Q for smaller CPU-based instances, and another queue larger GPU-based instances.
I like the idea of having a queue dedicated to CPU-based instances that has multiple agents running on it simultaneously....
Trying as a python subprocess...
Or the log of the init script?
In a future iteration, it'd be cool if you could configure presets. Like maybe you have an on-startup.sh
script you really like using to set up your instance, and VS Code extensions you want to pass to the --install-extensions ...
flag
Haha, that was a total gotcha for me. Yeah, a lot just wasn't even getting run due to the #!/bin/bash
part.
Anyway, wow! I finally got the precious console logs you thought to find, here they are:
2023-05-06 00:19:21
User aborted: stopping task (3)
2023-05-06 00:19:21
Successfully installed PyYAML-6.0 attrs-22.2.0 certifi-2022.12.7 charset-normalizer-3.1.0 clearml-agent-1.5.2 distlib-0.3.6 filelock-3.12.0 furl-2.1.3 idna-3.4 jsonschema-4.17.3 orderedmultidict-1.0.1 pathlib2-2.3.7....
The key seems to be placed in the expected location
configurations:
extra_clearml_conf: ""
extra_trains_conf: ""
extra_vm_bash_script: |
aws ssm get-parameter --region us-west-2 --name /clearml/github_ssh_private_key --with-decryption --query Parameter.Value --output text > ~/.ssh/id_rsa && chmod 600 ~/.ssh/id_rsa
source /clearml_agent_venv/bin/activate
hyper_params:
iam_arn: arn:aws:iam::<my account id>:instance-profile/clearml-2-AutoscaledInstanceProfileAutoScaledEC2InstanceProfile56A5348F-90fmf6H5OUBx
As opposed to using CRON or something π€£
You know, you could probably add some immortal containers to the docker-compose.yml
that use images with mongodump
and the ES equivalent installed.
The container(s) could have a bash script with a while loop in it that sleeps for 30 minutes and then does a backup. If you installed the AWS CLI inside, it could even take care of uploading to S3.
I like this idea, because docker-compose.yml
could make sure that if the backup container ever dies, it would be restarted.
Ah, but it's probably worth noting that the docker-compose.yml
does register the EC2 isntance that the server is running on as an agent listening on the services
queue, so ongoing tasks in that queue that happen to be placed on the server would get terminated when docker-compose down
is run.
@<1523701070390366208:profile|CostlyOstrich36> Oh thatβs smart. Is that to make sure no transactions happen during the backup? Would there be a risk of ongoing or pending tasks somehow getting corrupted if you shut the server down?
Wow, that is seriously impressive.
Hey! Sorry, I don't think I ever solved this for elasticsearch π
Earlier in the thread they mentioned that the agents are all resilient. So no ongoing tasks should be lost. I imagine even in a large organization, you could afford 5-10 minutes of downtime at 2AM or something.
That said, you'd only have 1 backup per day which could be a big deal depending on the experiments your running. You might want more than that.
Hey @<1523701482157772800:profile|AnxiousSeal95> ! I think ClearML's orchestrator is a great fit for ad-hoc experimentation, but not for (event-triggered) batch inference jobs that need to be relied on in production.
I'd only feel comfortable supporting pipelines that serve end users on a tool that is known for that, e.g. Metaflow, Dagster, or Airflow--mainly because those tools emphasize good monitoring and integration with the wider data ecosystem.
you mean as experiment management / model registry / data? I think this is the bread&butter of clearml
π― . I was wondering if anyone had had experience using ClearML together with one of these others.
I think most of them are alternatives to metaflow
Totally.
Like, if you google "dagster and clearml" or "prefect and clearml" or "airflow and clearml" -- I don't find any blogs written by people talking about how they use both of them together.
That's strange to me, becau...
I've also used Airflow and Dagster in prod, but not integrated them with an exp tracker.
Oh, right... the Docker image running on the instance takes care of the library versions. You guys are great!
@<1523701205467926528:profile|AgitatedDove14> you beautiful person, this is terrific! I do believe SageMaker has some nice monitoring/data drift capabilities that seem interesting, but these points you have here will be a fantastic starting point for my team's analysis of the products. I think this will help balance some of the over-enthusiasm towards using the native AWS solution.
I'll search around some more when I get time. I have no idea, but it feels like ClearML has already done the hard part which is creating clearml-session
in the first place.
This could be a really low-hanging OSS contribution that could make a real impact π .
So, we've already got a model registry: MLFlow
And we've got a serving framework we really like: BentoML and FastAPI
The debate is between ClearML and Metaflow for
- training models during the research phase
- re-training models on a schedule or event-based trigger in production
- running batch inference jobs on a schedule or event-based trigger in production
You have no idea what is committed to disk vs what is still contained in memory.
If you ran docker-compose down
and allowed ES to gracefully shut down, would ES finish writing everything to disk, therefore guaranteeing that the backups wouldn't get corrupted?
We should put a $100 bounty on a bash script that backs up and restores mongodb, redis, and ES, etc. to S3 using the most resiliant ways π