ClearML FAQ | I'M Probably Stupid, But How Do I Specify Worker Name? Usecase - I Want To Create Two Workers Using The Same Gpu, And New Worker Just Overwrites The Old One

Answered

I'M Probably Stupid, But How Do I Specify Worker Name? Usecase - I Want To Create Two Workers Using The Same Gpu, And New Worker Just Overwrites The Old One

I'm probably stupid, but how do I specify worker name? usecase - I want to create two workers using the same GPU, and new worker just overwrites the old one

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					DilapidatedParrot58
				
					0
					 × 1

Votes Newest

Answers 25

I think this one is on us, I don't think a search would have led you to the correct answer ...
I'll try to make sure they add something regrading the configuration 🙂

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

thanks! I need to read all parts of documentation really carefully =) for some reason, couldn't find this section

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					DilapidatedParrot58
				
					0
					 × 1

https://github.com/allegroai/trains-agent/blob/master/docs/trains.conf#L25

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Ohhhh , okay as long as you know, they might fall on memory...

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

🙂

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

perfect!

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					DilapidatedParrot58
				
					0
					 × 1

not sure what is the "right way" 🙂
But I do pkill -f "trains-agent --gpus 0" This will kill a process that started "trains-agent --gpus 0" Notice it matches the cmd pattern so it has to match the way you executed the agent. You can check it with ps -Af | grep trains-agent

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

DilapidatedDucks58 no don't say that, you are wonderful 😉

trains-agent --gpus 0 --queue my_queue -d
should create a worker machine:gpu0
Then you can do trains-agent --gpus 1 --queue my_queue -d which will create machine:gpu1

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Yes, I mean removing agent from the server

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					MysteriousBee56
				
					0
					 × 1

TRAINS_WORKER_NAME=first_agent trains-agent --gpus 0
and
TRAINS_WORKER_NAME=second_agent trains-agent --gpus 0

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

https://allegro.ai/docs/references/trains_ref/#agent-section

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

MysteriousBee56 , The agent is not running on the "server" it's running on its machine.
The server just reflects the fact he agent is up..
To actually take it down you need to SSH (or connect to that machine) and stop the actual trains-agent process.
What is exactly the scenario you had in mind?

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Ohh now I get it...
Wait a couple of hours, 0.16 is out today with trains-agent --stop flag 🙂

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Ups, you misunderstood me. I just want to remove specified agent. For example, I had 3 agents on the same queue with different worker names. So, if I remove them by applying what you said in this thread, all of them will be removed. However, I just want to remove one of them.

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					MysteriousBee56
				
					0
					 × 1

our GPUs are 48GB, so it's quite wasteful to only run one job per GPU
yeah, I'm aware of that, I would have to make sure they don't fail to infamous CUDA out of memory, but still

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					DilapidatedParrot58
				
					0
					 × 1

We should probably have a section on that (i.e. running two agents on the same GPU, then explain how top use it)

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

the weird part is that the old job continues running when I recreate the worker and enqueue the new job

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					DilapidatedParrot58
				
					0
					 × 1

AgitatedDove14 Is it possible to delete specified worker? I mean, I have 10 workers and I want to delete one of them?

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					MysteriousBee56
				
					0
					 × 1

another stupid question - what is the proper way to delete a worker? so far I've been using pgrep to find the relevant PID 😃

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					DilapidatedParrot58
				
					0
					 × 1

You mean why you have two processes ?

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

let me check

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

MysteriousBee56 what do you mean "delete a worker"
stop the agent running remotely ?

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

is it in documentation somewhere?

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					DilapidatedParrot58
				
					0
					 × 1

well okay, it's probably not that weird considering that worker just runs the container

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					DilapidatedParrot58
				
					0
					 × 1

that's right, I have 4 GPUs and 4 workers. but what if I want to run two jobs simultaneously at the same GPU

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					DilapidatedParrot58
				
					0
					 × 1

Write your answer

2K Views

25 Answers

5 years ago

2 years ago