What Is The Suggested Way Of Running Trains-Agent With Slurm? I Was Able To Do A Very Naive Setup: Trains-Agent Runs A Slurm Job. It Has The Disadvantage That This Slurm Job Is Blocking A Gpu Even If The Worker Is Not Running Any Task. Is There An Easy Wa

Answered

What is the suggested way of running trains-agent with SLURM? I was able to do a very naive setup: trains-agent runs a slurm job. It has the disadvantage that this slurm job is blocking a GPU even if the worker is not running any task. Is there an easy way to submit jobs to SLURM via the web UI without unnecessarily blocking GPUs in the cluster?

  				
Posted 
	4 years ago

					More  		
  Report
		
					HealthyStarfish45
				
					0
					 × 1

Votes Newest

Answers 30

I implemented the first version and it seems to work.

  				
Posted 
	4 years ago

					More  		
  Report
		
					HealthyStarfish45
				
					0
					 × 1

Hi HealthyStarfish45
Funny just today I had a similar discussion on slurm:
https://allegroai-trains.slack.com/archives/CTK20V944/p1603794531453000

Anyhow, when you say "[scale up agents]" are you referring to a machine constantly running an agent pulling jobs from the queue, where the machine itself (aka the resource) is managed as a slurm job?

  				
Posted 
	4 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

but I need to dig digger into the architecture to understand what we need exactly from k8s glue.

Once you do, feel free to share, basically there are two options , use the k8s scheduler with dynamic pods, or spin the trains-agent as a service pod, and let it spin the jobs

  				
Posted 
	4 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

HealthyStarfish45 my apologies, they do have it (this ability needs support for both trains-agent and server) but not in the open-source ...

  				
Posted 
	4 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Okay this more complicated but possible.
The idea is to write a glue layer (service) that pulls from the (i.e UI) queue
sets the slurm job
and puts it in a pending queue (so you know the job s waiting in the slurm scheduler)
There is a template here:
https://github.com/allegroai/trains-agent/blob/master/trains_agent/glue/k8s.py
I would love to help and setup a slurm glue in a similar manner
what do you think?

  				
Posted 
	4 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

sure, we can deal with the drivers

  				
Posted 
	4 years ago

					More  		
  Report
		
					HealthyStarfish45
				
					0
					 × 1

HealthyStarfish45

Is there a way to say to a worker that it should not take new tasks? If there is such a feature then one could avoid the race condition

Still undocumented, but yes, you can tag it as disabled.
Let me check exactly how.

  				
Posted 
	4 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

do you have docker installed on all slurm agent/worker machines

Docker support?

  				
Posted 
	4 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

AgitatedDove14 I need to finish working now, will be back in the evening and have a look.

  				
Posted 
	4 years ago

					More  		
  Report
		
					HealthyStarfish45
				
					0
					 × 1

HealthyStarfish45 could you take a look at the code, see if it makes sense to you?
What I'm getting to, is maybe we build a template, then you could fill in the gaps ?

  				
Posted 
	4 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

But I can't do this from the web ui, can I?

  				
Posted 
	4 years ago

					More  		
  Report
		
					HealthyStarfish45
				
					0
					 × 1

Sure thing, and I agree it seems unlikely to be an issue 🙂

  				
Posted 
	4 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

that's ok, I think that the race condition will be a non-issue. Thanks for checking!

  				
Posted 
	4 years ago

					More  		
  Report
		
					HealthyStarfish45
				
					0
					 × 1

I will only cancel daemons which are idle.

  				
Posted 
	4 years ago

					More  		
  Report
		
					HealthyStarfish45
				
					0
					 × 1

that would be great!

  				
Posted 
	4 years ago

					More  		
  Report
		
					HealthyStarfish45
				
					0
					 × 1

I hope you can do this without containers.

I think you should be fine, the only caveat is CUDA drivers, nothing we can do about that ...

  				
Posted 
	4 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

NICE!

  				
Posted 
	4 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

AgitatedDove14 I looked at the K8s glue code, having something similar but for SLURM would be great!

  				
Posted 
	4 years ago

					More  		
  Report
		
					HealthyStarfish45
				
					0
					 × 1

AgitatedDove14 I do not want to push you in any way, but if you could give me an estimate of the slurm glue code, that would be helpful. I should have a local installation of the trains server to experiment with next week.

  				
Posted 
	4 years ago

					More  		
  Report
		
					HealthyStarfish45
				
					0
					 × 1

AgitatedDove14 going back to the slurm subject, I have local trains installed on the cluster with slurm so I am ready to test. At the same time I was thinking whether a simple solution would do the job:
a) [scale up agents] monitor the trains queue, if there is something that was not started for some amount of time, and the number of agents is not yet at the maximum, then add an agent,
b) [scale down agents] if all the tasks are running and there are idle agents, kill an idle agent.
Or do you think that your glue code is simple to adapt?

  				
Posted 
	4 years ago

					More  		
  Report
		
					HealthyStarfish45
				
					0
					 × 1

so far everything works, the only problem I can think of is a race condition, but I will probably ignore it, which happens in the following scenario:
a) a worker finishes its current run, turns into an idle state,
b) my script scrapes the status of the worker, which is idle,
c) a new task is enqueued and picked by the worker,
d) the worker is killed after it managed to pull a task from the queue, so the task will be cancelled as well.

  				
Posted 
	4 years ago

					More  		
  Report
		
					HealthyStarfish45
				
					0
					 × 1

yes, happy to help! In fact I am also interested in the k8s glue, since in one of our use cases we are using jobs and not pods (to allow for spot instances in the cloud), but I need to dig digger into the architecture to understand what we need exactly from k8s glue.

  				
Posted 
	4 years ago

					More  		
  Report
		
					HealthyStarfish45
				
					0
					 × 1

AgitatedDove14 thanks, that will be helpful!

  				
Posted 
	4 years ago

					More  		
  Report
		
					HealthyStarfish45
				
					0
					 × 1

AgitatedDove14 I meant the following scenario:
trains-agents will be running as slurm jobs (possibly for a very long time), there is a program running on an access-node of the cluster (where no computation happens, but from where one can submit jobs to slurm), this program check is there are not enough or too many agents running and adjusts them by cancelling them or spinning new ones.

  				
Posted 
	4 years ago

					More  		
  Report
		
					HealthyStarfish45
				
					0
					 × 1

I hope you can do this without containers.

  				
Posted 
	4 years ago

					More  		
  Report
		
					HealthyStarfish45
				
					0
					 × 1

HealthyStarfish45 We are now working on improving the k8s glue (due to be finished next week) after that we can take a stab at slurm, it should be quite straight forward. Will you be able to help with a bit of testing (setting up a slurm cluster is always a bit of a hassle 🙂 )?

  				
Posted 
	4 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

HealthyStarfish45 if I understand correctly the trains-agent is running as daemon (i.e. automatically pulling jobs and executes them), the only point might be cancelling a daemon will cause the Task executed by that daemon to be canceled as well.
Other than that, sounds great!

  				
Posted 
	4 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Okay, what you can do is the following:
assuming you want to launch task id aabb12
The actual slurm command will be:
trains-agent execute --full-monitoring --id aabb12
You can test it on your local machine as well.
Make sure the trains.conf is available in the slurm job
(use trains-agent --config-file to point to a globally shared one)
What do you think?

  				
Posted 
	4 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Unfortunately there is no docker, there is only singularity. This cluster is used by many users and docker is not secure enough.

  				
Posted 
	4 years ago

					More  		
  Report
		
					HealthyStarfish45
				
					0
					 × 1

AgitatedDove14 Is there a way to say to a worker that it should not take new tasks? If there is such a feature then one could avoid the race condition.

  				
Posted 
	4 years ago

					More  		
  Report
		
					HealthyStarfish45
				
					0
					 × 1

Write your answer

1K Views

30 Answers

4 years ago

2 years ago