Reputation
Badges 1
127 × Eureka!This is totally what I was looking for! Yeah, by "good story for offline batch" I meant, "good feature support for ..."
I bookmarked this comment. I think I'll be doing a POC trying to show this functionality within the next month.
And for the session
clearml-session --queue sessions --docker python:3.9
Hmm... these people are recommending restarting docker completely. I may have tried that already, but I'll do it again when I get some time to be sure.
I see. Is it possible for two agents to be utilizing the same GPU? (like if the machine has a terrific GPU, but only one of them?)
Hey, thanks for responding!
Does there happen to be ClearML auto-logging... for MLFlow? That would make it super easy for us to migrate our existing training/batch inference jobs to ClearML 😄
I've also tried running a clearml-agent daemon
directly on my mac (not in docker) serving the sessions
queue for the ClearML server that is running in docker. When I do that, it consistently fails with a different error. Something to do with mounting a volume.
The agent commands are nothing special.
clearml-agent daemon --queue sessions --cpu-only --create-queue true --docker
I think it will work. There's a lot of really useful code in the black extension. I'm recruiting people now to join in on Friday. I'm actually very confident about it after messing around.
I'll search around some more when I get time. I have no idea, but it feels like ClearML has already done the hard part which is creating clearml-session
in the first place.
This could be a really low-hanging OSS contribution that could make a real impact 😄 .
@<1594863216222015488:profile|ConvincingGrasshopper20> throwing this out there... would you want to make this with me at the Hackathon??
I took a look
- I think the Outerbounds extension (the one in my screenshot) is currently closed source. That makes sense to me. A bit sad because it is highly similar.
- Another example could be the AWS ToolKit extension. But sadly, it's hardly a "minimal example". I was thinking it's relevant because it uses your local
~/.aws/
folder, which is similar to what we'd want to do.
Yeah. I'd need to clone this and run it locally to start to understand how it all works. Would be a cool exercise. They advertise that it's really easy to author VS Code extensions. I've seen pretty junior folks do it which makes me think it can't be too bad 😆
Yeah, I believe all VS Code Extensions are in TypeScript. My main point was that this is an example of a VS Code extension that executes a Python CLI.
This is a low-key open-source project if anyone wanted to contribute. Since the project is early, there are lots of high-impact things, e.g. UI polish, that would be relatively low effort 😄
@<1523701205467926528:profile|AgitatedDove14> you beautiful person, this is terrific! I do believe SageMaker has some nice monitoring/data drift capabilities that seem interesting, but these points you have here will be a fantastic starting point for my team's analysis of the products. I think this will help balance some of the over-enthusiasm towards using the native AWS solution.
In a future iteration, it'd be cool if you could configure presets. Like maybe you have an on-startup.sh
script you really like using to set up your instance, and VS Code extensions you want to pass to the --install-extensions ...
flag
At the time that I run python aws_autoscaler.py --remote
, that clearml-services
worker is the only worker on the services
queue. So it will be the worker that picks up the autoscaler task.
But the task seems to be failing on startup due to the CLEARML_API_HOST
not being set, but it is set for the docker container that the agent is running on.
Here's the full autoscaler log where the failure happens if that's helpful.
Sorry, clarifying:
The agent-services
entry in the docker-compose file seems to add a single worker to the services
queue
Oh, that is cool. I captured all this. Maybe I'll make a user-data.sh
script and docker-compose.yml
file that brings all these things together. Probably won't have time for a few weeks.
Oh awesome @<1523701132025663488:profile|SlimyElephant79> ! If you want to take a look, I made a big list of things to add. I'm working on a docker-compose.yaml
file so we can have a good local development environment.
There's a lot of room to improve this from cleaning up the code to adding features on the list.
How it works / what we finished:
- We used the SaaS ClearML, started an EC2 instance, and manually installed and ran the
clearml-agent daemon
on it - We ran
clearml-init
on our laptops to generate theclearml.conf
file. - The extension is in TypeScript, so...
- We started trying to write code with the Python SDK to list sessions, but realized calling that from the extension would be hard, so we opted to have the TypeScript code make calls to the ClearML API server directly, e.g. ...
Is there some way we could programmatically list all current ClearML sessions?
We need a way to do that, maybe with the clearml-session
CLI in order to populate the VS Code extension menu.
Oh! System tags! That would definitely have been a better way to do it. We ended up querying for tasks in the "DevOps" project with the name "Interactive Session"
I may be able to prepare a PR that only allows specifying the subnet ID. Can you help me brainstorm scenarios you’d want to see tested? Also, do these need to be automated tests?
Haha, that was a total gotcha for me. Yeah, a lot just wasn't even getting run due to the #!/bin/bash
part.
Anyway, wow! I finally got the precious console logs you thought to find, here they are:
2023-05-06 00:19:21
User aborted: stopping task (3)
2023-05-06 00:19:21
Successfully installed PyYAML-6.0 attrs-22.2.0 certifi-2022.12.7 charset-normalizer-3.1.0 clearml-agent-1.5.2 distlib-0.3.6 filelock-3.12.0 furl-2.1.3 idna-3.4 jsonschema-4.17.3 orderedmultidict-1.0.1 pathlib2-2.3.7....