
Reputation
Badges 1
129 × Eureka!I see. Is it possible for two agents to be utilizing the same GPU? (like if the machine has a terrific GPU, but only one of them?)
Yes, it's pretty lame that a clearml-agent
can only process one task at a time if it's not listening to a services
queue 🤔
At the time that I run python aws_autoscaler.py --remote
, that clearml-services
worker is the only worker on the services
queue. So it will be the worker that picks up the autoscaler task.
But the task seems to be failing on startup due to the CLEARML_API_HOST
not being set, but it is set for the docker container that the agent is running on.
Here's the full autoscaler log where the failure happens if that's helpful.
possibly cheaper on the cloud (Lambda vs EC2 instance)
Whoa, are you saying there's an autoscaler that doesn't use EC2 instances? I may be misunderstanding, but that would be very cool.
Maybe I should have said: my plan is to use AWS StepFunctions where a single task in the DAG is an entire ClearML pipeline . The non-ClearML steps would orchestrate putting messages into a queue, doing retry logic, and triggering said pipeline.
I think at some point, there has to be some amount of...
This is a low-key open-source project if anyone wanted to contribute. Since the project is early, there are lots of high-impact things, e.g. UI polish, that would be relatively low effort 😄
Oh! System tags! That would definitely have been a better way to do it. We ended up querying for tasks in the "DevOps" project with the name "Interactive Session"
I could imagine other useful automations for reacting to failed tasks that have certain tags, including alerting.
I realize we could move a lot of this logic into ClearML itself: make handler functions that run within the services queue. That would work for logic that is implemented in Python. But I believe it would be harder for our team to detect and respond to failures in the event handler functions if they were placed there because it seems unclear how we could use our existing systems a...
Oh I wasn’t aware of that. I don’t think it’d work for this use case though. We’re trying to test the behavior you can see here in this extension https://share.descript.com/view/g0SLQTN6kAk so basically the examples I said in that earlier message
So the problem came back even with this new URL. I discovered clearing your cookies fixes it.
Interesting . It’s actually just running locally on my laptop. It seemed only to be an issue when pointing the ClearML session CLI at my local version of ClearML. Still thinking about this one.
It seems you have a specific workflow in mind, but I'm not sure I follow it. Can you give a specific example ?
Absolutely. So, let's say a DS tags a model in ClearML with "release candidate". It'd be great to have that trigger a number of processes, each with their own retry logic:
- A fairness/bias evaluation, potentially as a task in ClearML itself. This would load the model and run some sample datasets through it. The
- Pipeline to prepare for deployment. Trigger a GitHub Actions ...
Thanks for the response @<1523701205467926528:profile|AgitatedDove14> !
What would you consider an event?
I was thinking of the TriggerScheduler
's definition of an event. Pretty much, any thing the TriggerSchedule allows you to react to, it'd be great to be able to publish those events to a queue external to ClearML, e.g. a tag added to a model (or removed), a state in a task changing, etc. We'd want as much metadata about that event as possible. So if the event is due to a task...
@<1557175205510516736:profile|ShallowSwan53> at this point, I think this question deserves it's own thread. I'm curious about it too!
I'm imagining:
- The EC2 instance would be in a private subnet, accessible only on the VPN (read: VPC)
- The API Gateway and Load Balancer would also be on the VPC and therefore have access to the private subnet BUT the API Gateway or Load Balancer themselves would be exposed to the public internet.
That way, to do the JWT authentication, the load balancer or API Gateway could reach out to the EC2 instance on the private network to authenticate any incoming ClearML SDK requests.
I've also used Airflow and Dagster in prod, but not integrated them with an exp tracker.
To do this, I think I need to know:
- Can you trigger a pre-existing Pipeline via the ClearML REST API? I'd want to have a Lambda function trigger the Pipeline for a batch without needing to have all the Pipeline code in the lambda function. Something like
curl -u '<clearml credetials>'
None,...
- [probably a big ask] If the pipeline succeeds/fails, can ClearML emit an event that I can react to? Like mayb...
I took a stab at writing an automated trigger to handle this. The goal is: anytime a pipeline succeeds or fails, let AWS know so that the input records can be placed onto a retry queue (or not)
I'm trying to get a trigger to work in general, and then I'll add the more complex AWS logic. But I seem to be missing a step somewhere:
I wrote a file called set_triggers.py
from clearml.automation.trigger import TriggerScheduler
TRIGGER_SCHEDULER = TriggerScheduler()
from pprint import...
You have no idea what is committed to disk vs what is still contained in memory.
If you ran docker-compose down
and allowed ES to gracefully shut down, would ES finish writing everything to disk, therefore guaranteeing that the backups wouldn't get corrupted?
I could potentially write a selenium script to make a set of keys, but I'd prefer to avoid that 😅
Does this mean that none of the credientials in this file can be used with the clearml SDK when the docker-compose.yaml starts up with a fresh state?
Is there anyway to achieve such a behavior? Or are manual steps simply required to get a working set of keys. I'm trying to prepare a docker-compose file that I can use for automated tests of our VS Code extension.
That's fabulous. This is definitely how my team prefers to structure projects. I hadn't gotten around to trying that out in our POC of ClearML yet, but I'm certain this is how our group will solve this problem
I don't know that you'd have to pre-build credentials into docker. If you could specify a set of credentials as environment variables to the docker run ...
command or something, that would work just fine.
The goal is to be able to run docker-compose up
in CI, which starts a clearml-server. And then make several API calls to the started ClearML server to prove that the VS Code extension code is working.
Examples:
- Assert that the extension can auth with ClearML
- Assert that the ext...
How it works / what we finished:
- We used the SaaS ClearML, started an EC2 instance, and manually installed and ran the
clearml-agent daemon
on it - We ran
clearml-init
on our laptops to generate theclearml.conf
file. - The extension is in TypeScript, so...
- We started trying to write code with the Python SDK to list sessions, but realized calling that from the extension would be hard, so we opted to have the TypeScript code make calls to the ClearML API server directly, e.g. ...
If the load balancer it Gateway can do the computation and leverage caching, we’re much safer against DDOS attacks. In general, I’d prefer not to have our EC2 instance directly exposed to the public Internet.
Here's a docker-compose I've been playing with. It doesn't have the same restart problem you're describing, but I did change the volume mounts: None
Yay! Man, I want to do ClearML with "hard mode" (non-enterprise, self-hosted) first, before trying to sell BENlabs (my work) on it. I could see us paying for enterprise to get the Hyper Datasets and Vault features if our scientists/developers fall in love with it--they probably will if we can get them to adopt it since right now we have a homemade system that isn't nearly as nice as ClearML.
@<1523701087100473344:profile|SuccessfulKoala55> how exactly do you configure ClearML to use the cr...
Is there a way we can protect a ClearML deployment with a load balancer or API Gateway that is exposed to the whole world, but is protected by authentication so that only authorized clients can get in?