Hi Team! Is There A Way To Make Clearml’S Aws Autoscaler And Queues Resource-Aware Please? I.E. If We Can Say, As We Enqueue Our Job, How Much Ram Or Gpu-Ram Or Even Gpus It Needs, Have The Scheduler/Autoscaler Dispatch The Job To Instances That Are Of Th

Unanswered

I'll try to describe the scenario I was thinking would cause ClearML to break down:

Assume:

We've got a queue called streaming
We've got an S3 bucket with images landing inside
When the images land, they go into a queue
When there are 100 images in the queue, we trigger a ClearML pipeline to ingest, transform, run inference on the batch, and then write the results somewhere
Let's say we get 1,000,000 images in the Bucket per hour. That might be 1,000,000 / 100 = 10,000 batches. So it'd be 10,000 triggers of our pipeline, likely with a lot of those batches being run concurrently, especially if the processing is heavy and takes a long time (like doing a style transfer or something generative on the images).
A pipeline might consist of 4 tasks. So we're asking our autoscaling fleet to fulfill 40,000 tasks in an hour.

Writing the task results to the database is probably a light process that only requires CPU and minimal resources.

By my understanding, a worker (which I assumed was an entire EC2 instance) can only process one task at a time. So in that hour, in order to have a low queue wait time, you may need something like 1,000 EC2 instances spun up to handle all of the incoming tasks.

I was thinking: if a worker (EC2 instance) is capable of fulfilling multiple jobs at the same time, you could do more tasks on one EC2 instance, allowing you to fully utilize the resources on the machine and thereby need fewer machines to get your work done.

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					BattyCrocodile47
				
					0
					 × 1

318 Views

0 Answers

2 years ago