Hi Clearml Community. I Interviewed Nir Bar-Lev On The Practical Ai Podcast, So I Had Allegro/Clearml In The Back On My Mind. I’M Launching A New Project At My Org Now, And I Think Clearml Might Be A Good Fit. Questions That Have Come Up Are:

Answered

Hi clearML community. I interviewed Nir Bar-Lev on the Practical AI podcast, so I had Allegro/clearML in the back on my mind. I’m launching a new project at my org now, and I think clearML might be a good fit. Questions that have come up are:
How well can the ML Ops component handle job queuing on a multi-GPU server (i.e., receiving jobs and scheduling them on the various GPUs or even combinations of GPUs)? We are buying an on-prem GPU server for this project, and this user/job management is something we will definitely need. Interested in the differences between Enterprise and community. Who would I talk with?
Thanks in advance! Excited to dive in a little deeper.

  				
Posted 
	4 years ago

					More  		
  Report
		
					GleamingGrasshopper63
				
					0
					 × 1

Votes Newest

Answers 12

Hi GleamingGrasshopper63

How well can the ML Ops component handle job queuing on a multi-GPU server

This is fully supported 🙂
You can think of queues as a way to simplify resources for users (you can do more than that,but let's start simple)
Basicalli qou can create a queue per type of GPU, for example a list of queues could be: on_prem_1gpu, on_prem_2gpus, ..., ec2_t4, ec2_v100
Then when you spin the agents, per type of machine you attach the agent to the "correct" queue.

Interested in the differences between Enterprise and community. Who would I talk with?

Well, I'm not sure I'm the guy for that, but I think the gist is, enterprise (or paid) adds security / permissions, and expands data management layer (basically adding a query layer to the datasets , just like DB only with versioning and links to files). Obviously hosting, support etc, but I guess that is given

  				
Posted 
	4 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

This is great AgitatedDove14 . Thanks for the info on job queuing. That is one of the main things we are trying to enable. Do you work at Allegro, or is there someone else here I could talk with about Enterprise? I’m interested in the user management and permissions side of things along with the Data layer. Depending of course on pricing (because we are a non-profit).

  				
Posted 
	4 years ago

					More  		
  Report
		
					GleamingGrasshopper63
				
					0
					 × 1

AgitatedDove14 With the GPU job management, do you know if you can limit usage per user. That is, could I limit a certain user to only using at most 2 GPUs at a time or only particular GPUs, as something similar to that?

  				
Posted 
	4 years ago

					More  		
  Report
		
					GleamingGrasshopper63
				
					0
					 × 1

Hi there, Evangelist for ClearML here. What you are describing is a conventional provisioning solution such as SLURM. And that works with ClearML as well ☺ btw a s survivor of such provisioning schemes I don't think they are always worth it

  				
Posted 
	4 years ago

					More  		
  Report
		
					GrumpyPenguin23
				
					0
					 × 1

Yeah, I don’t necessary want a traditional queuing system like in HPC clusters. I just want functional GPU management for users. As long as there is a job queue that works well, I can communicate the rest to the team for now.

  				
Posted 
	4 years ago

					More  		
  Report
		
					GleamingGrasshopper63
				
					0
					 × 1

I would ideally just want to have NVIDIA drivers and Docker on the on-prem nodes (along with the clearML agents). Would that allow me to get by with basic job scheduling/queues through clearML?

  				
Posted 
	4 years ago

					More  		
  Report
		
					GleamingGrasshopper63
				
					0
					 × 1

There short answer is "definitely yes" but to get maximum usage you will probably want to setup priority queues

  				
Posted 
	4 years ago

					More  		
  Report
		
					GrumpyPenguin23
				
					0
					 × 1

I would ideally just want to have NVIDIA drivers and Docker on the on-prem nodes (along with the clearML agents). Would that allow me to get by with basic job scheduling/queues through clearML?
Yes this is fully supported and very easy to setup.
Regrading limiting users usage. This is doable, I think the easiest solution both for users and management of the cluster is introducing priority into the queue, basically a user can push job into low priority, and only some users can push into higher priority queue, this removes the need for hardcoded limit and allows flexibility in terms of user resource usage. Obviously you can always rearrange jobs inside the queue

  				
Posted 
	4 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

GrumpyPenguin23 and is priority queues something existing in clearML or would that require some external queuing solution like SLURM?

  				
Posted 
	4 years ago

					More  		
  Report
		
					GleamingGrasshopper63
				
					0
					 × 1

Totally within ClearML :the_horns: :the_horns:

  				
Posted 
	4 years ago

					More  		
  Report
		
					GrumpyPenguin23
				
					0
					 × 1

Awesome. That’s what we want!

  				
Posted 
	4 years ago

					More  		
  Report
		
					GleamingGrasshopper63
				
					0
					 × 1

I really want to avoid HPC craziness and keep things as simple as we can.

  				
Posted 
	4 years ago

					More  		
  Report
		
					GleamingGrasshopper63
				
					0
					 × 1

Write your answer

1K Views

12 Answers

4 years ago

2 years ago