But from your other answer, I think I'm understanding that you can have multiple agents on a single instance listening to the same queue.
So we could maybe initialize 4 instances of the agent on a single EC2 instance which would allow us to handle a higher volume of small batches concurrently without tying up the entire instance.
Thank you! I think it does. It’s just now dawning on me that: because a pipeline is composed of multiple tasks, different tasks in the pipeline could run on different machines. Or more specifically, they could run on different queues, and as you said, in your other response, we could have a Q for smaller CPU-based instances, and another queue larger GPU-based instances.
I like the idea of having a queue dedicated to CPU-based instances that has multiple agents running on it simultaneously. Like maybe four agents. Those agents could be used for more I/O-intensive tasks, such as writing results to our data warehouse. I think that would be a good used case for having a single resource handle multiple tasks concurrently.
Thanks for discussing this so thoroughly with me!
I will be starting with the AWS auto scaler script in the ClearML examples in GitHub. Do you happen to know if using that script? There is a straightforward way to provide a user-data.sh script? I imagine that’s how we would do things like fetching secrets from AWS’s secrets manager and starting the concurrent agents
As an infrastructure engineer, I feel that this is a fairly significant shortcoming of ClearML.
Having the ability to pack jobs/tasks onto the same "resource" (underlying server/EC2 instance) would
- simplify the experience for data scientists
- open up a streaming use case, wherein batch (offline) inference could be done directly inside of a ClearML pipeline in reaction to an event/trigger (like new data landing in your data lake). As it is, you can make this work, but if you start to get a high volume of events, you'd either be autoscaling until you were broke (one ec2 instance per streaming event :shocked_face_with_exploding_head: ), or your queue would have an impossibly long wait time- automatic retraining would have similar problems at high volume, though re-training a model is probably a much lower-volume type of work than streaming
That said, credit where credit is due: it's pretty amazing that ClearML allows you to orchestrate compute in a self-hosted manner without needing to have Kubernetes expertise on your team.
@<1523701205467926528:profile|AgitatedDove14> great! (I'm on the Pro version :) ).
Hi @<1546665634195050496:profile|SolidGoose91> , when configuring a new autoscaler you can click on '+ Add item' under compute resources and this will allow you to have another resource that is listening to another queue.
You need to set up all the resources to listen to the appropriate queues to enable this allocation of jobs according to resources.
Also in general - I wouldn't suggest having multiple autoscalers/resources listen to the same queue. 1 resource per queue. A good way to manage queues is by giving them relevant names - for example: strong cpu, weak cpu, strong gpu, medium gpu, ... etc
I'll try to describe the scenario I was thinking would cause ClearML to break down:
Assume:
- We've got a queue called
streaming
- We've got an S3 bucket with images landing inside
- When the images land, they go into a queue
- When there are 100 images in the queue, we trigger a ClearML pipeline to ingest, transform, run inference on the batch, and then write the results somewhere
- Let's say we get 1,000,000 images in the Bucket per hour. That might be 1,000,000 / 100 = 10,000 batches. So it'd be 10,000 triggers of our pipeline, likely with a lot of those batches being run concurrently, especially if the processing is heavy and takes a long time (like doing a style transfer or something generative on the images).
A pipeline might consist of 4 tasks. So we're asking our autoscaling fleet to fulfill 40,000 tasks in an hour.
Writing the task results to the database is probably a light process that only requires CPU and minimal resources.
By my understanding, a worker (which I assumed was an entire EC2 instance) can only process one task at a time. So in that hour, in order to have a low queue wait time, you may need something like 1,000 EC2 instances spun up to handle all of the incoming tasks.
I was thinking: if a worker (EC2 instance) is capable of fulfilling multiple jobs at the same time, you could do more tasks on one EC2 instance, allowing you to fully utilize the resources on the machine and thereby need fewer machines to get your work done.
Yep, although I'm quite sure you could build some logic on top of that to manage proper queueing
Can the “multiple agents on a single queue” scenario, combined with the autoscaler, spawn multiple agents on a single EC2 instance, by chance, please? (thinking e.g. 8 agents on a 8xGPU machine)
because a pipeline is composed of multiple tasks, different tasks in the pipeline could run on different machines.
Yes!
. Or more specifically, they could run on different queues, and as you said, in your other response, we could have a Q for smaller CPU-based instances, and another queue larger GPU-based instances.
Exactly !
I like the idea of having a queue dedicated to CPU-based instances that has multiple agents running on it simultaneously. Like maybe four agents.
This is supported.
Do you happen to know if using that script? There is a straightforward way to provide a user-data.sh script?
Yes, this is part of the configuration (there is a CLI based wizard, so should be relatively easy)
This thread should be immortalized. Super stoked to try this out!
Thanks @<1523701070390366208:profile|CostlyOstrich36> !
- I hadn’t found the multiple-resources within the same autoscaler. Could you point me to the right place please? Are they all used interexchangeably based upon availability, rather than based on job needs?
- We thought of using separate queues (we do that for CPU vs GPU queues), but having ClearML automatically dispatch to the right based on a job specification would be more flexible. (for example, we could then think to dispath dynamically to the right instance type based on the hyperparameters we’re searching)
- It also becomes much simpler when we start to mix single-GPU and multi-GPU jobs into the same queue.
Hi @<1546665634195050496:profile|SolidGoose91> , you can actually create multiple resources inside the same autoscaler application. I would suggest attaching each resource to different queues as well.
OK, so no way to have an automatic dispatch to different, correctly-sized instances, it’s only achievable by submitting to different queues?
Yes you can 🙂 (though not on the open-source version)
Having the ability to pack jobs/tasks onto the same "resource" (underlying server/EC2 instance)
This is essentially a "queue". Basically a queue is a way to abstract a specific type of resource, so that you can achieve exactly what you descibed.
open up a streaming use case, wherein batch (offline) inference could be done directly inside of a ClearML pipeline in reaction to an event/trigger (like new data landing in your data lake).
Yes, that's exactly how clearml is designed, am I missing something here @<1541954607595393024:profile|BattyCrocodile47> ?
That said if you are thinking multi-node load-balancing for streaming request processing, then you should take a look at clearml-serving,
But from your other answer, I think I'm understanding that you
can
have multiple agents on a single instance listening to the same queue.
Correct
So we could maybe initialize 4 instances of the agent on a single EC2 instance which would allow us to handle a higher volume of small batches concurrently without tying up the entire instance.
Correct (that said I do not understand how come a single Task does not utilize the CPU, I was under the impression it is running a mode, see details below)
By my understanding, a worker (which I assumed was an entire EC2 instance)
Basically the assumption is that you are able to maximize the CPU/GPU on that instance (the specific DL/ML component), the other you can run on other instances. The EC2 instances will not be shutdown when they are done with a Single Task, but only after they are idle for X minutes.
Pipeline Logic (as opposed to pipeline component) is running on the "services" agent machine, which is running multiple pipelines at the same time. The component itself is running on another machine (i.e. the pipeline logic launches it), and the actual compute is done on that machine. the AWS autoscaler can limit the number of concurrent "compute EC2" and these are just running the "inference" itself.
Does that make sense ?