the weird part is that the old job continues running when I recreate the worker and enqueue the new job
well okay, it's probably not that weird considering that worker just runs the container
DilapidatedDucks58 no don't say that, you are wonderful 😉
trains-agent --gpus 0 --queue my_queue -d
should create a worker machine:gpu0
Then you can do trains-agent --gpus 1 --queue my_queue -d
which will create machine:gpu1
that's right, I have 4 GPUs and 4 workers. but what if I want to run two jobs simultaneously at the same GPU
Ohhhh , okay as long as you know, they might fall on memory...
our GPUs are 48GB, so it's quite wasteful to only run one job per GPU
yeah, I'm aware of that, I would have to make sure they don't fail to infamous CUDA out of memory, but still
TRAINS_WORKER_NAME=first_agent trains-agent --gpus 0
andTRAINS_WORKER_NAME=second_agent trains-agent --gpus 0
We should probably have a section on that (i.e. running two agents on the same GPU, then explain how top use it)
thanks! I need to read all parts of documentation really carefully =) for some reason, couldn't find this section
I think this one is on us, I don't think a search would have led you to the correct answer ...
I'll try to make sure they add something regrading the configuration 🙂
another stupid question - what is the proper way to delete a worker? so far I've been using pgrep to find the relevant PID 😃
not sure what is the "right way" 🙂
But I do pkill -f "trains-agent --gpus 0"
This will kill a process that started "trains-agent --gpus 0" Notice it matches the cmd pattern so it has to match the way you executed the agent. You can check it with ps -Af | grep trains-agent
AgitatedDove14 Is it possible to delete specified worker? I mean, I have 10 workers and I want to delete one of them?
MysteriousBee56 what do you mean "delete a worker"
stop the agent running remotely ?
Yes, I mean removing agent from the server
MysteriousBee56 , The agent is not running on the "server" it's running on its machine.
The server just reflects the fact he agent is up..
To actually take it down you need to SSH (or connect to that machine) and stop the actual trains-agent process.
What is exactly the scenario you had in mind?
Ups, you misunderstood me. I just want to remove specified agent. For example, I had 3 agents on the same queue with different worker names. So, if I remove them by applying what you said in this thread, all of them will be removed. However, I just want to remove one of them.
Ohh now I get it...
Wait a couple of hours, 0.16 is out today with trains-agent --stop flag 🙂