Reputation
Badges 1
127 × Eureka!Hey! Sorry, I don't think I ever solved this for elasticsearch ๐
I've also used Airflow and Dagster in prod, but not integrated them with an exp tracker.
Hey @<1523701482157772800:profile|AnxiousSeal95> ! I think ClearML's orchestrator is a great fit for ad-hoc experimentation, but not for (event-triggered) batch inference jobs that need to be relied on in production.
I'd only feel comfortable supporting pipelines that serve end users on a tool that is known for that, e.g. Metaflow, Dagster, or Airflow--mainly because those tools emphasize good monitoring and integration with the wider data ecosystem.
Dang! @<1590514584836378624:profile|AmiableSeaturtle81> awesome answer thank you! You seem like an awesome person to know. Definitely connect if you'd like to talk ops stuff sometime. None
Hi friends, I'm just seeing these new messages. I read these links and I agree with @<1557175205510516736:profile|ShallowSwan53> . It's nice that the webapp has these pages, but what is the workflow to actually use this registry?
Also, @<1557175205510516736:profile|ShallowSwan53> , do you have a specific workflow in mind that you're hoping to get from ClearML?
At BEN, we're experimenting with
- BentoML for model serving. It's a Python REST framework a lot like FastAPI, but with some nice...
Oh duh, thanks. What about non standard entrypoints (as opposed to arguments) like accelerate launch train.py
?
Caching can be a reason. Say you do some heavy data loading / processing in step 1. Now you're developing step 2.
It'd be nice not to have to re-run Step 1 every time you want to test a change to step 2.
You could find a way to simply write your output of step1 to disk and do everything in one step, or you could let ClearML handle that caching for you--with the added benefit that others collaborating remotely can also use the outputs of steps you've cached with ClearML
Oh there's parallelization as well. You could have step 1 gather the data, and then fan out to N parallel steps that all do different things with the data, for example hyper parameter tuning
Actually that's wrong: really this is the current volume mount
'-v', '/tmp/clearml_agent.ssh.cbvchse1:/.ssh',
Could changing these values to /root/.ssh
work? Do you know what use within the docker image ClearML is using?
I do agree with your earlier observation that the target of that mount seems wrong. I would think that the volume mount should be -v /root/.ssh:/root/.ssh
but instead it's -v /root.ssh:/.ssh
It's an Amazon Linux AMI with the AWS CLI pre-installed on it. It uses the AWS CLI to fetch the key from AWS SSM Parameter Store. It's granted read access to that SSM Parameter via the instance role.
I don't see it as an argument in Task.init
or Task.execute_remotely
Hey, thanks for responding!
Does there happen to be ClearML auto-logging... for MLFlow? That would make it super easy for us to migrate our existing training/batch inference jobs to ClearML ๐
I can't think of any changes we might have made on our side to cause that ๐ค
Let's see. The task log? I think this is it.
Hi. Yes that totally makes sense. Itโs just that we donโt want the logic that does the Jenkins trigger to be in a ClearML handler or task, but rather as a handler that acts as a subscriber in a pub-sub system.
This is because we have a pub-sub architecture that we already use, it can handle retries, etc. also we will likely want multiple systems to react to notifications in the pub sub system. We already have a lot of setup for this.
I guess the conclusion is: I realize itโs possible...
That's fabulous. This is definitely how my team prefers to structure projects. I hadn't gotten around to trying that out in our POC of ClearML yet, but I'm certain this is how our group will solve this problem
@<1523701205467926528:profile|AgitatedDove14> you beautiful person, this is terrific! I do believe SageMaker has some nice monitoring/data drift capabilities that seem interesting, but these points you have here will be a fantastic starting point for my team's analysis of the products. I think this will help balance some of the over-enthusiasm towards using the native AWS solution.
This thread should be immortalized. Super stoked to try this out!
My understanding may be bad. Say I have a single EC2 instance. Is that instance only able to handle one task at a time?
Or can I start multiple instances of the clearml-agent
process on it and then have one task per agent?
And if that's the case, can we have multiple agents on the EC2 instance listening to the same queue, e.g. default
. Or would this only work if they were listening to different queues?
Yes, it's pretty lame that a clearml-agent
can only process one task at a time if it's not listening to a services
queue ๐ค
Thanks for the response @<1523701205467926528:profile|AgitatedDove14> !
What would you consider an event?
I was thinking of the TriggerScheduler
's definition of an event. Pretty much, any thing the TriggerSchedule allows you to react to, it'd be great to be able to publish those events to a queue external to ClearML, e.g. a tag added to a model (or removed), a state in a task changing, etc. We'd want as much metadata about that event as possible. So if the event is due to a task...
I see. Is it possible for two agents to be utilizing the same GPU? (like if the machine has a terrific GPU, but only one of them?)
@<1557175205510516736:profile|ShallowSwan53> at this point, I think this question deserves it's own thread. I'm curious about it too!
Actually, dumb question: how do I set the setup script for a task?
Or the log of the init script?
Here's a screenshot if a session where I first try to clone as ssm-user
, but it fails, then I change to root
and it succeeds
That's with the key at /root/.ssh/id_rsa