Hi SweetBadger76
Further investigation showed that the worker was created with a dedicated CLEARML_HOST_IP
- so running the
clearml-agent daemon --stop
didn't kill it (but it did appear in the clearml-agent list
But once we added the
CLEARML_HOST_IP `
CLEARML_HOST_IP=X.X.X.X clearml-agent daemon --stop
it finally killed it
The worker name is part of the key, so worker_d1bd92a3b039400cbafc60a7a5b1e52b___tests___clearml-server-agent-group-cpu-agent-5df4476cfc-j54gh:0
means the worker name in this case is clearml-server-agent-group-cpu-agent-5df4476cfc-j54gh:0
and the command you're using to run the agent?
Question - if we change the
clearml.conf
do we need to stop and start the daemon?
yes
agree -
we understand now that the worker is the default worker that is installed after runningpip install clearml-agent
is it possible to remove it ? since all tasks that use the worker don't have the correct credentials.
btw can you screenshot your clearml-agent list and UI please ?
Sorry - I'm a Helm newbee
when runninghelm search repo clearml --versions
I can't see version 3.6.2 - the highest is 3.5.0
This is the repo that we used to get the helm charthelm repo add allegroai
What I'm I missing?
Sorry -
After updating the repo I can see that the newest chart is 4.1.1
SweetBadger76 should I update to this version?
Well it seems that we have similar https://github.com/allegroai/clearml-agent/issues/86
we are not able to reference this orphan worker (it does not show up with ps -ef | grep clearml-agent
-
but still appears with clearml-agent list
and not able to stop with clearml-agent daemon --stop clearml-server-agent-group-cpu-agent-5df4476cfc-j54gh:0
gettingCould not find a running clearml-agent instance with worker_name=clearml-server-agent-group-cpu-agent-5df4476cfc-j54gh:0 worker_id=clearml-server-agent-group-cpu-agent-5df4476cfc-j54gh:0
However - if we create a different worker
we are able to use it and clone the repo. e.g.CLEARML_WORKER_NAME=my_worker CLEARML_WORKER_ID=my_worker clearml-agent daemon --detached --queue my_queue
latest version? only the clearml
chart?
hey OutrageousSheep60
what about the process ? there must be one clearml-agent process that runs somwhere, and that is why it can continue reporting to the server
Here is the screenshot - we deleted all the workers - accept for the one that we couldn't
Hi SweetBadger76 -
I'm I misunderstanding how this tests
worker runs?
i am not sure i get you here.
when pip installing clearml-agent, it doesnt fire any agent. the procedure is that after having installed the package, if there isnt any config file, you do clearml-agent init
and you enter the credentials, which are stored in clearml.conf. If there is a conf file, you simply edit it and manually enter the credentials. so i dont understand what you mean by "remove it"
Still trying to understand what is this default worker.
I've removed clearml.conf
and reinstall clearml-agent
then running theclearml-agent list
gets the following error
` Using built-in ClearML default key/secret
clearml_agent: ERROR: Could not find host server definition (missing ~/clearml.conf
or Environment CLEARML_API_HOST)
To get started with ClearML: setup your own clearml-server
, or create a free account at and run
clearml-agent init
Then returning the
clearml.conf , and running
clearml-agent list we get
- company:
id: d1bd92a3b039400cbafc60a7a5b1e52b
name: clearml
id: clearml-server-agent-group-cpu-agent-5df4476cfc-j54gh:0
ip: 10.124.0.4
key: worker_d1bd92a3b039400cbafc60a7a5b1e52b___tests___clearml-server-agent-group-cpu-agent-5df4476cfc-j54gh:0
last_activity_time: '2022-07-13T09:37:31.718067+00:00'
last_report_time: '2022-07-13T09:37:31.718067+00:00'
queues:
- id: 74794fe91f70452eb7149c34cc39315a
name: default
num_tasks: 0
register_time: '2022-07-01T23:39:00.733133+00:00'
register_timeout: 600
tags: []
user:
id: tests
name: testshow was this worker started? BTW - the api
credentialsin the
clearml.confis of a specific user (and not user named
tests ` )
not sure i understand
we are running the daemon in a detached mode
clearml-agent daemon --queue <execution_queue_to_pull_from> --detached
is this running from the same linux user on which you checked the git ssh clone on that machine? The only thing that could account for this issue is somehow the agent is not getting the right info from the ~/.ssh folder
OutrageousSheep60 it looks to me this agent is part of the server's deployment
I think I have a lead.
looking at list of workers from clearml-agent list
e.g. https://clearml.slack.com/archives/CTK20V944/p1657174280006479?thread_ts=1657117193.653579&cid=CTK20V944
is there a way to find the worker_name
?
in the above example the worker_id
is clearml-server-agent-group-cpu-agent-5df4476cfc-j54gh:0
but I'm not able to stop this worker using the command
clearml-agent daemon --stop
since this orphan worker has no corresponding clearml.conf
is this running from the same linux user on which you checked the git ssh clone on that machine?
yes
The only thing that could account for this issue is somehow the agent is not getting the right info from the ~/.ssh folder
maybe -
Question - if we change the clearml.conf
do we need to stop and start the daemon?
Yeah, that's what I was looking for 🙂
hi OutrageousSheep60
sounds like the agent is in reality ... dead. It sounds logical, because you cannot see it using ps
however, it would worth to check if you still can see it in the UI
using the helm charts
https://github.com/allegroai/clearml-helm-charts
can you try again after having upgraded to 3.6.2 ?
Hi SweetBadger76 ,
Well - apparently I was mistaken.
I still have a ghost worker that i'm mot able to remove (I had 2 workers on the same queue - that caused my confusion).
I can see it in the UI and when I run clearml-agent list
And although I'm stoping the worker specificallyclearml-agent daemon --stop <worker_id>
I'm gettingCould not find a running clearml-agent instance with worker_name=<worker_id> worker_id=<worker_id>
so running the command clearml-agent -d list
returns the https://clearml.slack.com/archives/CTK20V944/p1657174280006479?thread_ts=1657117193.653579&cid=CTK20V944