Hi SubstantialElk6
I'm not sure what you are asking 🙂
Basically the clearml-agent
will pull a Task from an execution queue, and execute it (based on the definition on the Task, i.e. git repo, python packages docker image etc.)
Are you asking regrading the k8s integration ?
(This is not a must, you can run the clearml-agent
bare-metal on any OS)
I would like to run ClearML agent on kubernetes. So basically I need to run the image on a pod, but there isn't any information on how the agent would communicate with the code, nor how it would spawn more pods to run the task.
SubstantialElk6 Ohh okay I see.
Let's start with background on how the agent works:
When the agent pulls a job (Task), it will clone the code based on the git credentials available on the host itself, or based on the git_user/git_pass configured in ~/clearml.conf
https://github.com/allegroai/clearml-agent/blob/77d6ff6630e97ec9a322e6d265cd874d0ab00c87/docs/clearml.conf#L18
The agent can work in two modes:
Virtual environment mode, where it will create a new venv for each experiment based on the "installed packages" section in the Task, this section is fully requirements.txt compatible. If "installed packages" is empty empty it will revert to "requirements.txt from the repo itself Docker mode, where the agent will spin a docker (see Task Execution Tab, base docker image) then inside the docker it will clone the repository and install the packages based on "Installed packages" section (just like in the venv mode)Make sense ?
Hi, so this means if i want to use Kubernetes, i would have to 'manually' install multiple agents on all the worker nodes?
Hi SubstantialElk6
No need for that, you can use the helm chart (or spin them once with kubctl) then they take care of scheduling by themselves.
You can also use the k8s glue (basically spinning kubernetes pods automatically for you, based on the Tasks that you push into the ClearML queue)
https://github.com/allegroai/clearml-agent/blob/master/examples/k8s_glue_example.py
In short, two possible deployments
Static k8s pod running the agent (then the agent runs all the experiments inside the pod or as sibling pod) Dynamic where the k8s-glue pulls Tasks from the ClearML queue, creates a k8s job and sends the k8s job (notice the job itself is the clearml-agent running the specific Task for us, including cloning the code, python packages, arguments etc.
Hi, i tried the k8s-glue on my k8s setup and needed some clarifications on some of the arguments.
--queue. Does this only refer to default and service? How can i create new queue to which it can sync with the ClearML server? --ports-mode. I'm not sure what ports mode does. doc says "add a label to the pod which can be used as service". Which pod is it referring to in the first place? All args pertaining to --ports-mode. (E.g. base-pod-num, gateway-address...etc) --overrides-yaml. What is the default yaml? --template-yaml. Do you have a sample of this?
The doc also mentioned preconfigured services with selectors in the form of
"ai.allegro.agent.serial=pod-<number>" and a targetPort of 10022.
Would you have any examples of how to do this?
Hi SubstantialElk6
Yes this is the queue the glue will pull jobs from and push into the k8s. You can create a new queue from the UI (go to the workers&queues page and to the Queue Tab and press on "create new" Ignore it 🙂 this is if you are using config maps and need TCP routing to your pods As you noted this is basically all the arguments you need to pass for (2). Ignore them for the time being This is the k8s overrides to use if launching the k8s job with kubectl (basically --overrides) If passed instead of calling kubectl run, you provide a k8s template for kubectl apply
The doc also mentioned
preconfigured services with selectors in the form of
"ai.allegro.agent.serial=pod-<number>" and a targetPort of 10022.
Unless you need TCP routing to the pods you can ignore this part
So i kept trying, but i'm stuck on this when i run python k8s_glue_example.py
TypeError: init () got an unexpected keyword argument 'base_pod_num'
Reply…
SubstantialElk6 whats the command line you are using ?
This is probably the whole script.
kubectl get nodes
pip install clearml-agent
python k8s_glue_example.py
first line to make sure kubectl is connected to k8s.
python k8s_glue_example.py --help
To get all the commands for configurations
You should probably pass a few :)
For example:examples/k8s_glue_example.py --queue k8s_gpu - --namespace pod-clearml-conf ~/trains.conf --template-yaml example/base.yml
TypeError:
init
() got an unexpected keyword argument 'base_pod_num'
Could you post the entire log?
python k8s_glue_example.py --queue gpu --namespace default
Traceback (most recent call last):
File "k8s_glue_example.py", line 86, in <module>
main()
File "k8s_glue_example.py", line 80, in main
namespace=args.namespace,
File "/home/administrator/clearml-agent-k8s/venv/lib/python3.6/site-packages/clearml_agent/helper/base.py", line 239, in _ call _
cls. instances[cls] = super(Singleton, cls). call_(*args, **kwargs)
TypeError: _ init _() got an unexpected keyword argument 'base_pod_num'
SubstantialElk6 I just executed it , and everything seems okay on my machine.
Could you pull the latest clearml-agent from the github and try again ?
EDIT:
just try to run:git clone
cd clearml-agent python examples/k8s_glue_example.py
Hi AgitatedDove14 , i've got the same error. It would appear that the code references clearml_agent/helper/base.py
which i believe is part of clearml-agent v0.17.1. Could that be the issue?
Can you run the entire thing on your own machine (just making sure it doesn't give this odd error) ?