Hi All! I Am Currently Using A Self-Hosted Clearml Server And Was Looking To Integrate The Clearml Agent To Make Better Usage Of Our Hpc Resources With Gpu Autoscaling. I Am Aware That Clearml Already Supports Aws Autoscaler (In The Pro-Tier), But My Tea

Answered

Hi all! I am currently using a self-hosted ClearML server and was looking to integrate the ClearML Agent to make better usage of our HPC resources with GPU autoscaling.

I am aware that ClearML already supports AWS Autoscaler (in the pro-tier), but my team doesn't have any AWS resources, we have been given some GPUs/CPUs hosted on a remote supercomputer to use instead. However, we need to allocate resources to ourselves manually, using an srun command or sbatch . I was wondering if the Agent can directly interact with SLURM cluster in this way, or if it is even possible:

If no resources are available, e.g. all GPUs/CPUs are being used in interactive sessions by our devs, keep tasks in a buffer/queue.- For this situation, I have some resources to keep a clearml-agent daemon permanently running and listening for a queue, as described in the docs. But this same machine is not capable of running ML tasks, so job execution needs to be dynamic ->- In a remote, HPC cluster, if there are GPUs available for our group, spawn a node and allocate resources (using SLURM), and run the task.Essentially, in the same way it works with AWS.

I am unsure of how this should be accomplished. Does ClearML-Agent support script execution as setup before running the task? For example, I could run a script to auto-generate an sbatch job.

Furthermore, would I need to install ClearML-Agent into the node? The environments must be rootless, and it seems like ClearML-Agent has to use docker. Docker isn't available on our system, but I cannot find documentation for any podman support within ClearML docs. Any help or pointers would be appreciated, thanks!

  				
Posted 
	one year ago

					More  		
  Report
		
					HighCoyote66
				
					0
					 × 1

Votes Newest

Answers 5

Hi HighCoyote66

However, we need to allocate resources to ourselves manually, using an

srun

command or

sbatch

Long story short, there is a full SLURM integration, basically you push a job into the ClearML queue and it produces a slurm job that uses the agent to setup the venv/container and run your Task, but this is only part of the enterprise version 😞
You can however do the following (notice this is pseudo code, I probably have a typo in the srun command)

Clone your Task in the UI
Copy the new Task ID
srun clearml-agent execute --id <task-id-here>This will use slurm to allocate the job and clearml-agent to actually set the environment automatically and run your code (with the ability to override arguments from the UI, like you would regularly). The missing part is of course the integration to the queue system and the automation (which unfortunately is not part of the open source)

  				
Posted 
	one year ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Hey AgitatedDove14 , I think the 'execute' function from the clearml-agent is great. I've been testing/using it for a few days, and, while it's a little more hands-on, it has been an amazing workaround for us uni students who have no budget 😂 . That said, I've been using clearml-agent execute <job_id> to great workaround for us uni students who have no budget . That said, I've been using clearml-agent execute <job_id> t run jobs on an HPC node. That said, with this method I am not able to see the console on the web-ui. I've been defining this
#!/bin/bash
#SBATCH --job-name=test_worker
#SBATCH --output=./logs/test_worker_%j.out
#SBATCH --error=./logs/test_worker_%j.err
in my SBATCH, and the only way I can see the logs is by manually logging into our HPC and viewing the logs directly using cat or tail , etc. Do you know if there's some way to redirect this output back into the web UI? Is there some API call from the docs I'm overlooking? Once again, thanks for all your help!

  				
Posted 
	one year ago

					More  		
  Report
		
					HighCoyote66
				
					0
					 × 1

Ohh try to add --full-monitoring to the clearml-agent execute
None

  				
Posted 
	one year ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Is the clearml-agent queue not available in the open source?

fully available in the open source, what is missing is the SLURM connection, in the open source daemon is installed per machine (node) and spins containers/venv on the machine. The enterprise version adds support so it uses SLURM to provision the node. I hope it helps 🙂

so do you think it would be possible to spin up another daemon, which listens to this daemon, which then runs a slurm job?

This is exactly what the enterprise version does, I think there is a some built in assumption that only enterprises use SLURM

I want to emphasize that I do not mean to undermine your enterprise tier, but I am just trying to work with the limitations of the resources my university, which means I have to use our HPC resources.

Yep totally with you, SLURM is very university HPC oriented 🙂 this is why I suggested the srun + clearml-agent execute, wdyt?

  				
Posted 
	one year ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Oh I see, I confused with what "Agent Orchestration" meant on the website. Is the clearml-agent queue not available in the open source?
I see that you can do clearml-agent daemon --queue , so do you think it would be possible to spin up another daemon, which listens to this daemon, which then runs a slurm job?
I want to emphasize that I do not mean to undermine your enterprise tier, but I am just trying to work with the limitations of the resources my university, which means I have to use our HPC resources.

  				
Posted 
	one year ago

					More  		
  Report
		
					HighCoyote66
				
					0
					 × 1

Write your answer

1K Views

5 Answers

one year ago