AbruptWorm50 - does the issue still occur, or did you manage to resolve it?
Looking at the 2nd image you sent, I see that in addition to "services" queue, you also have queues called "training" and "training*_2" - and the experiments you circled is in the "training" queue. In that image - there are no experiments in the services queue.
If you press on the "services" queue (like you did in the first image) you can view the experiments in the queue and the workers. Can you check if there is a situation where there are tasks pending in that queue while one of the workers is idle?
CumbersomeCormorant74 As you can see in the attached - there were 2 experiments at the same time, but only one agent pulled the task, even though the second agent was free and listening to the queue.
AbruptWorm50 - the agent poll the queue, so any free agent can pull tasks. From the graph on the right, it looks like experiments were not waiting in the queue (max experiments is 1, and it was immediately pulled). Can you also check what happens if two experiments are enqueued at the same time?
I will elaborate on the situation:
I have 2 agents - training and training_2. They are both listening to the services queue, but only 'training' pulls the tasks. At the beginning I had 2 tasks in the services queue. Then, agent 'training' pulled one and is currently executing it, but for some reason - it also pulled the 2nd task into it's queue, that is although this agent is not free and I have another agent which is: 'training_2'.
AbruptWorm50 - just to make sure there is no misunderstanding - the last image you sent is on the "training" queue and not on the "services" queue. Are there free agents running on that queue?
Can you check if there is a situation where there are tasks pending in that queue while one of the workers is idle?
In what queue ? In services there are no pending tasks because they were all pulled by 'training'.
just to maker sure, how do you start the agents? Are you using the
--services-mode
option?
I used clearml-agent daemon --detached --services-mode --queue services --docker ubuntu:18.04 --cpu-only
Wait, so you have two agents:
one running in normal mode and monitoring the training
queue another running in the services mode, monitoring the services
queue?
We have been trying to resolve the issue. I will comment here again if any more problems arise. Thanks!
AbruptWorm50 just to maker sure, how do you start the agents? Are you using the --services-mode
option?
Hi AbruptWorm50 ,
Should be working - this is basically only up to the agents... Can you perhaps share the agents' logs?
Just to clarify again - when I start the agents I run :clearml-agent daemon --detached --queue training
and then: clearml-agent daemon --detached --services-mode --queue services --docker ubuntu:18.04 --cpu-only
This is why there are 'training' and 'training_2' queues.