Ohh now I get it...
Wait a couple of hours, 0.16 is out today with trains-agent --stop flag 🙂
We should probably have a section on that (i.e. running two agents on the same GPU, then explain how top use it)
I think this one is on us, I don't think a search would have led you to the correct answer ...
I'll try to make sure they add something regrading the configuration 🙂
AgitatedDove14 Is it possible to delete specified worker? I mean, I have 10 workers and I want to delete one of them?
TRAINS_WORKER_NAME=first_agent trains-agent --gpus 0
andTRAINS_WORKER_NAME=second_agent trains-agent --gpus 0
not sure what is the "right way" 🙂
But I do pkill -f "trains-agent --gpus 0"
This will kill a process that started "trains-agent --gpus 0" Notice it matches the cmd pattern so it has to match the way you executed the agent. You can check it with ps -Af | grep trains-agent
another stupid question - what is the proper way to delete a worker? so far I've been using pgrep to find the relevant PID 😃
Ohhhh , okay as long as you know, they might fall on memory...
thanks! I need to read all parts of documentation really carefully =) for some reason, couldn't find this section
our GPUs are 48GB, so it's quite wasteful to only run one job per GPU
yeah, I'm aware of that, I would have to make sure they don't fail to infamous CUDA out of memory, but still
Ups, you misunderstood me. I just want to remove specified agent. For example, I had 3 agents on the same queue with different worker names. So, if I remove them by applying what you said in this thread, all of them will be removed. However, I just want to remove one of them.
DilapidatedDucks58 no don't say that, you are wonderful 😉
trains-agent --gpus 0 --queue my_queue -d
should create a worker machine:gpu0
Then you can do trains-agent --gpus 1 --queue my_queue -d
which will create machine:gpu1
well okay, it's probably not that weird considering that worker just runs the container
MysteriousBee56 , The agent is not running on the "server" it's running on its machine.
The server just reflects the fact he agent is up..
To actually take it down you need to SSH (or connect to that machine) and stop the actual trains-agent process.
What is exactly the scenario you had in mind?
Yes, I mean removing agent from the server
the weird part is that the old job continues running when I recreate the worker and enqueue the new job
MysteriousBee56 what do you mean "delete a worker"
stop the agent running remotely ?
that's right, I have 4 GPUs and 4 workers. but what if I want to run two jobs simultaneously at the same GPU