
Reputation
Badges 1
129 × Eureka!The issue went away. I'm still not sure why, but what finally made it work was creating a set of credentials manually in the UI and then setting those in my ~/clearml.conf
file.
Do you happen to have a link to a docker-compose.yaml
file that has a hardcoded set of credentials?
I want to seed the clearml instance with a set of credentials and ~/clearml.conf
to run automated tests.
I SOLVED IT, NO NEED TO READ FURTHER 😄
I'm a chump and didn't read the docs: None
Oh, I think I got overexcited and didn't look at this closely. So this ACCESS/SECRET key pair is on the agent-services
container.
I can see that agent-services
is simply a container running `clearml-agent daemon --queue ser...
Wow, it really does not want to show the output of those print statements in stdout. Here's the output of the task from the console after cloning it. Confirmed that the setup script and all code changes are present:
So here's a snippet from my aws_autoscaler.yaml
file
I took a stab at writing an automated trigger to handle this. The goal is: anytime a pipeline succeeds or fails, let AWS know so that the input records can be placed onto a retry queue (or not)
I'm trying to get a trigger to work in general, and then I'll add the more complex AWS logic. But I seem to be missing a step somewhere:
I wrote a file called set_triggers.py
from clearml.automation.trigger import TriggerScheduler
TRIGGER_SCHEDULER = TriggerScheduler()
from pprint import...
Hmm... these people are recommending restarting docker completely. I may have tried that already, but I'll do it again when I get some time to be sure.
Haha, that was a total gotcha for me. Yeah, a lot just wasn't even getting run due to the #!/bin/bash
part.
Anyway, wow! I finally got the precious console logs you thought to find, here they are:
2023-05-06 00:19:21
User aborted: stopping task (3)
2023-05-06 00:19:21
Successfully installed PyYAML-6.0 attrs-22.2.0 certifi-2022.12.7 charset-normalizer-3.1.0 clearml-agent-1.5.2 distlib-0.3.6 filelock-3.12.0 furl-2.1.3 idna-3.4 jsonschema-4.17.3 orderedmultidict-1.0.1 pathlib2-2.3.7....
Thanks for replying Martin! (as always)
Do you think ClearML is a strong option for running event-based training and batch inference jobs in production? That’d include monitoring and alerting. I’m afraid that Metaflow will look far more compelling to our teams for that reason.
Since it deploys onto step functions, the scheduling is managed for you and I believe alerts for failing jobs can be set up without adding custom code to every pipeline.
If that’s the case, then we’d probably only...
That is great! This is all the motivation I needed to decide to do a POC at some point.
@<1523701070390366208:profile|CostlyOstrich36> Oh that’s smart. Is that to make sure no transactions happen during the backup? Would there be a risk of ongoing or pending tasks somehow getting corrupted if you shut the server down?
Hey, thanks for responding!
Does there happen to be ClearML auto-logging... for MLFlow? That would make it super easy for us to migrate our existing training/batch inference jobs to ClearML 😄
Oh my goodness. Thank you! I'd seen that before, but for some reason it didn't register I could run that with VS Code...
But this config should almost never need to change!
Host clearml-session
HostName localhost
User root
Port 8022
Oh this is thought provoking. Yeah, the idea of using ClearML for R&D is super appealing (to me speaking as an MLOps engineer 😆 ). And having the power of Metaflow's scheduler (on Step Functions with Event Bridge since we'd do the AWS-native deployment) also makes sense to me.
I'll keep asking questions about how we could do event-based jobs with alerting built in on ClearML in a different thread later on.
I pasted your points (anonymously) onto the Metaflow slack to le...
When you run the docker-compose.yml
on an EC2 instance, you can configure user login for the ClearML webserver. But the files API is still open to the world, right? (and same with the backend?)
We could solve this by placing the EC2 instance into a VPN.
One disadvantage to that approach is it becomes annoying to reach the model registry from outside the VPN, like if you have a deployment pipeline based in GitHub Actions. Or if you wanted to trigger a ClearML pipeline from a VPC that isn...
One idea: is it possible to store usable credentials in advance and place them in a volume that the ClearML containers can access and then use?
Yeah, I believe all VS Code Extensions are in TypeScript. My main point was that this is an example of a VS Code extension that executes a Python CLI.
Dang! @<1590514584836378624:profile|AmiableSeaturtle81> awesome answer thank you! You seem like an awesome person to know. Definitely connect if you'd like to talk ops stuff sometime. None
Hey @<1523701482157772800:profile|AnxiousSeal95> ! I think ClearML's orchestrator is a great fit for ad-hoc experimentation, but not for (event-triggered) batch inference jobs that need to be relied on in production.
I'd only feel comfortable supporting pipelines that serve end users on a tool that is known for that, e.g. Metaflow, Dagster, or Airflow--mainly because those tools emphasize good monitoring and integration with the wider data ecosystem.
I've also used Airflow and Dagster in prod, but not integrated them with an exp tracker.
At the time that I run python aws_autoscaler.py --remote
, that clearml-services
worker is the only worker on the services
queue. So it will be the worker that picks up the autoscaler task.
But the task seems to be failing on startup due to the CLEARML_API_HOST
not being set, but it is set for the docker container that the agent is running on.
Here's the full autoscaler log where the failure happens if that's helpful.
I can't think of any changes we might have made on our side to cause that 🤔
My understanding may be bad. Say I have a single EC2 instance. Is that instance only able to handle one task at a time?
Or can I start multiple instances of the clearml-agent
process on it and then have one task per agent?
And if that's the case, can we have multiple agents on the EC2 instance listening to the same queue, e.g. default
. Or would this only work if they were listening to different queues?
Here's a docker-compose I've been playing with. It doesn't have the same restart problem you're describing, but I did change the volume mounts: None
Disclaimer: I'm not familiar enouch with the ClearML codebase to vouch for the quality of this PR, although it is short which is typically good . The feature we're interested in is the ability to specify the subnet_id
.
Caching can be a reason. Say you do some heavy data loading / processing in step 1. Now you're developing step 2.
It'd be nice not to have to re-run Step 1 every time you want to test a change to step 2.
You could find a way to simply write your output of step1 to disk and do everything in one step, or you could let ClearML handle that caching for you--with the added benefit that others collaborating remotely can also use the outputs of steps you've cached with ClearML
Thank you! For now, it's kind of nice that it just picks up your credentials from your conf file. No extra setup required beyond the onboarding ClearML has you do 😄
And look! It's working, assuming you start the clearml session up yourself:
I have the same behavior whether or not I put task.execute_remotely(...)
before or after the call to run_shell_script()
Thanks for this!! I may try it and if I do and it works I’ll look into writing a plugin for ZenML and Metaflow that auto initializes the parent task and registers the steps as child tasks. Super helpful thank you!
Yes, it's pretty lame that a clearml-agent
can only process one task at a time if it's not listening to a services
queue 🤔
And for the session
clearml-session --queue sessions --docker python:3.9