Reputation
Badges 1
129 × Eureka!This is totally what I was looking for! Yeah, by "good story for offline batch" I meant, "good feature support for ..."
I bookmarked this comment. I think I'll be doing a POC trying to show this functionality within the next month.
In a future iteration, it'd be cool if you could configure presets. Like maybe you have an on-startup.sh script you really like using to set up your instance, and VS Code extensions you want to pass to the --install-extensions ... flag
I literally just ran into this minutes ago and was about to file a bug report. A colleague ran into the same problem. It looks like urllib3 upgraded to v2 last week.
I've also used Airflow and Dagster in prod, but not integrated them with an exp tracker.
I don't know that you'd have to pre-build credentials into docker. If you could specify a set of credentials as environment variables to the docker run ... command or something, that would work just fine.
The goal is to be able to run docker-compose up in CI, which starts a clearml-server. And then make several API calls to the started ClearML server to prove that the VS Code extension code is working.
Examples:
- Assert that the extension can auth with ClearML
- Assert that the ext...
My understanding may be bad. Say I have a single EC2 instance. Is that instance only able to handle one task at a time?
Or can I start multiple instances of the clearml-agent process on it and then have one task per agent?
And if that's the case, can we have multiple agents on the EC2 instance listening to the same queue, e.g. default . Or would this only work if they were listening to different queues?
Oh, right... the Docker image running on the instance takes care of the library versions. You guys are great!
Ah, but it's probably worth noting that the docker-compose.yml does register the EC2 isntance that the server is running on as an agent listening on the services queue, so ongoing tasks in that queue that happen to be placed on the server would get terminated when docker-compose down is run.
I'll search around some more when I get time. I have no idea, but it feels like ClearML has already done the hard part which is creating clearml-session in the first place.
This could be a really low-hanging OSS contribution that could make a real impact 😄 .
I symlinked
/opt/clearml
to
/mnt/xvda/clearml
Genius! I don't think I accounted for making sure the volumes ended up in the EBS volume mount in this CDK example ^^^. And I modified the docker-compose.yml file to point at a different location. Sym-linking is totally the route I should take if I get time to come back and clean up this repo.
Duh! I bet VS Code's Python extensions like the VS Code Black Extension would be a really good starting place. They are small and are wrappers around a Python CLI tool. I bet there's a lot we could adapt for the ClearML CLI
Here's the repo: I've recorded a few update videos documenting how we learned about authoring VS Code extensions and how we got it to it's current state. Linked to those in order in the README.
ChatGPT has made working with TypeScript and the VSCode extension framework really nice! None
It seems you have a specific workflow in mind, but I'm not sure I follow it. Can you give a specific example ?
Absolutely. So, let's say a DS tags a model in ClearML with "release candidate". It'd be great to have that trigger a number of processes, each with their own retry logic:
- A fairness/bias evaluation, potentially as a task in ClearML itself. This would load the model and run some sample datasets through it. The
- Pipeline to prepare for deployment. Trigger a GitHub Actions ...
How it works / what we finished:
- We used the SaaS ClearML, started an EC2 instance, and manually installed and ran the
clearml-agent daemonon it - We ran
clearml-initon our laptops to generate theclearml.conffile. - The extension is in TypeScript, so...
- We started trying to write code with the Python SDK to list sessions, but realized calling that from the extension would be hard, so we opted to have the TypeScript code make calls to the ClearML API server directly, e.g. ...
If the load balancer it Gateway can do the computation and leverage caching, we’re much safer against DDOS attacks. In general, I’d prefer not to have our EC2 instance directly exposed to the public Internet.
So the problem came back even with this new URL. I discovered clearing your cookies fixes it.
The agent commands are nothing special.
clearml-agent daemon --queue sessions --cpu-only --create-queue true --docker
Man, I owe you lunch sometime @<1523701205467926528:profile|AgitatedDove14> . Thanks for being so detailed in your answers.
Okay! So the pipeline ID is really just a task ID. So cool!
Not sure I fully understand what you mean here...
Sorry, I'll try again. Here's an illustrated example with AWS Step Functions (pretend this is a ClearML pipeline). If the pipeline fails, I'd want to have a chance to do some logic to react to that. Maybe in a step called "on_pipeline_failed" or someth...
@<1557175205510516736:profile|ShallowSwan53> at this point, I think this question deserves it's own thread. I'm curious about it too!
I see. Is it possible for two agents to be utilizing the same GPU? (like if the machine has a terrific GPU, but only one of them?)
Sorry, clarifying:
The agent-services entry in the docker-compose file seems to add a single worker to the services queue
As an infrastructure engineer, I feel that this is a fairly significant shortcoming of ClearML.
Having the ability to pack jobs/tasks onto the same "resource" (underlying server/EC2 instance) would
- simplify the experience for data scientists
- open up a streaming use case, wherein batch (offline) inference could be done directly inside of a ClearML pipeline in reaction to an event/trigger (like new data landing in your data lake). As it is, you can make this work, but if you start to get ...
Thanks for replying Martin! (as always)
Do you think ClearML is a strong option for running event-based training and batch inference jobs in production? That’d include monitoring and alerting. I’m afraid that Metaflow will look far more compelling to our teams for that reason.
Since it deploys onto step functions, the scheduling is managed for you and I believe alerts for failing jobs can be set up without adding custom code to every pipeline.
If that’s the case, then we’d probably only...
Hi. Yes that totally makes sense. It’s just that we don’t want the logic that does the Jenkins trigger to be in a ClearML handler or task, but rather as a handler that acts as a subscriber in a pub-sub system.
This is because we have a pub-sub architecture that we already use, it can handle retries, etc. also we will likely want multiple systems to react to notifications in the pub sub system. We already have a lot of setup for this.
I guess the conclusion is: I realize it’s possible...
you mean as experiment management / model registry / data? I think this is the bread&butter of clearml
💯 . I was wondering if anyone had had experience using ClearML together with one of these others.
I think most of them are alternatives to metaflow
Totally.
Like, if you google "dagster and clearml" or "prefect and clearml" or "airflow and clearml" -- I don't find any blogs written by people talking about how they use both of them together.
That's strange to me, becau...
Oh interesting. Is the hope that doing that would somehow result in being able to use those credentials to make authenticated API calls?
Thank you! For now, it's kind of nice that it just picks up your credentials from your conf file. No extra setup required beyond the onboarding ClearML has you do 😄
And look! It's working, assuming you start the clearml session up yourself: