From your description, it clearly seems that the Agents are connecting to a different Trains Server... This was the whole point of using a different queue name, since every server srats out with the default
queue 🙂
Great, let us know how it goes.
Have a great weekend!
The question is, assuming they're all using http://apiserver-service:8008 - can you find out which server they're actually connecting to?
ColossalAnt7 if you enqueue an experiment into the queue the 3 agents are monitoring, does one of the agents starts running the experiment? (I think changing the setup to 1 agents will be easier to debug)
Just want to make sure the agents are reporting but for some reason you're missing it 🙂
Just making sure, you changed both the agent one and the agent-services one?
Also, I would try using a new queue (i.e. create a new queue from the UI, and configuring the agent to use that queue) - this will show us for certain if the agent is talking to the right server...
That's great, from that I understand that the trains-services worker does appear in the UI, is it correct? Did the task run? Did you change the trainsApiHost
under agentservices
in the values.yaml?
Or - which api-server the UI is actually connecting to? 🙂
I also tried portforwarding the apiserver and discovered that something else was binding localhost:8008.
When i browsed there directly i got the same sort of respond that i get by portforwarding the api server to another port
But with a different server ID.
Turns out that my webserver on my k8s tried to access localhost:8008 and reached to trains server that was deployed on my local machine.
No, I was wrong... 😞 - the error is actually returned from the apiserver
itself, so that works...
FriendlySquid61 I didnt notice that that the agent service could be configured. Looking inside the values.yml in the chart source code i see that the default value is http://apiserver-service:8008 . Ill try changing the agents value to it and see what happens.
SuccessfulKoala55 Ill give that a try to and write back what happens
I think ill have to continue this on sunday. Thank you so much for your help so far.
Before i liked the idea behind trains but now i also adore the dedicated support.
Have a good weekend
Exactly, the trains service does appear and completed an experiment correctly. I changed it to http://apiserver-service.trains.svc.cluster.local:8008 as mentioned in the beginning of the issue. Here is the output when i try to curl this address from inside the agents in the cluster.https://allegroai-trains.slack.com/archives/CTK20V944/p1604595233210500?thread_ts=1604593483.209000&cid=CTK20V944
Ok thanks alot. Now everything makes more sense.
I have some follow up questions but i think they should go to their own threads 😃
I also made an dummy-agent pod by taking the deployment manifest and changing it so that the pod created sleeps instead of running trains-agent in its main process.
I then the installations manually and then run your command
apt-get update ; apt-get install -y curl python3-pip git; curl -sSL
| sh ; python3 -m pip install -U pip ; python3 -m pip install trains-agent ; TRAINS_DOCKER_SKIP_GPUS_FLAG=1 TRAINS_AGENT_K8S_HOST_MOUNT=/root/.trains:/root/.trains trains-agent daemon --docker --force-current-version
Here is the output (now its not complaining about an identical worker id running) but the UI stills shows nothing
` Current configuration (trains_agent v0.16.1, location: /root/trains.conf):
agent.worker_id =
agent.worker_name = dummy-agent
agent.force_git_ssh_protocol = false
agent.python_binary =
agent.package_manager.type = pip
agent.package_manager.pip_version = <20.2
agent.package_manager.system_site_packages = false
agent.package_manager.force_upgrade = false
agent.package_manager.conda_channels.0 = defaults
agent.package_manager.conda_channels.1 = conda-forge
agent.package_manager.conda_channels.2 = pytorch
agent.package_manager.torch_nightly = false
agent.venvs_dir = /root/.trains/venvs-builds.1
agent.vcs_cache.enabled = true
agent.vcs_cache.path = /root/.trains/vcs-cache.1
agent.venv_update.enabled = false
agent.pip_download_cache.enabled = true
agent.pip_download_cache.path = /root/.trains/pip-download-cache
agent.translate_ssh = true
agent.reload_config = false
agent.docker_pip_cache = /root/.trains/pip-cache
agent.docker_apt_cache = /root/.trains/apt-cache.1
agent.docker_force_pull = false
agent.default_docker.image = nvidia/cuda:10.1-runtime-ubuntu18.04
agent.default_python = 3.8
agent.cuda_version = 111
agent.cudnn_version = 0
api.version = 1.5
api.verify_certificate = true
api.default_version = 1.5
api.http.max_req_size = 15728640
api.http.retries.total = 240
api.http.retries.connect = 240
api.http.retries.read = 240
api.http.retries.redirect = 240
api.http.retries.status = 240
api.http.retries.backoff_factor = 1.0
api.http.retries.backoff_max = 120.0
api.http.wait_on_maintenance_forever = true
api.http.pool_maxsize = 512
api.http.pool_connections = 512
api.api_server =
sdk.storage.cache.default_base_dir = ~/.trains/cache
sdk.storage.cache.size.min_free_bytes = 10GB
sdk.storage.direct_access.0.url = file://*
sdk.metrics.file_history_size = 100
sdk.metrics.matplotlib_untitled_history_size = 100
sdk.metrics.images.format = JPEG
sdk.metrics.images.quality = 87
sdk.metrics.images.subsampling = 0
sdk.metrics.tensorboard_single_series_per_graph = false
sdk.network.metrics.file_upload_threads = 4
sdk.network.metrics.file_upload_starvation_warning_sec = 120
sdk.network.iteration.max_retries_on_server_error = 5
sdk.network.iteration.retry_backoff_factor_sec = 10
sdk.aws.s3.key =
sdk.aws.s3.region =
sdk.aws.boto3.pool_connections = 512
sdk.aws.boto3.max_multipart_concurrency = 16
sdk.log.null_log_propagate = false
sdk.log.task_log_buffer_capacity = 66
sdk.log.disable_urllib3_info = true
sdk.development.task_reuse_time_window_in_hours = 72.0
sdk.development.vcs_repo_detect_async = true
sdk.development.store_uncommitted_code_diff = true
sdk.development.support_stopping = true
sdk.development.default_output_uri =
sdk.development.force_analyze_entire_repo = false
sdk.development.suppress_update_message = false
sdk.development.detect_with_pip_freeze = false
sdk.development.worker.report_period_sec = 2
sdk.development.worker.ping_period_sec = 30
sdk.development.worker.log_stdout = true
sdk.development.worker.report_global_mem_used = false
Worker "dummy-agent:gpuall" - Listening to queues:
+----------------------------------+---------+-------+
| id | name | tags |
+----------------------------------+---------+-------+
| 3b853e1c7c864789ac9f4cf55312348d | default | |
+----------------------------------+---------+-------+
Running in Docker mode (v19.03 and above) - using default docker image: nvidia/cuda:10.1-runtime-ubuntu18.04 running python3
Failed creating temporary copy of ~/.ssh for git credential
Running TRAINS-AGENT daemon in background mode, writing stdout/stderr to /tmp/.trains_agent_daemon_outrhomu1ms.txt `
ColossalAnt7 can you try connecting to one of the trains-agent pods and run trains-agent manually using the following command:TRAINS_DOCKER_SKIP_GPUS_FLAG=1 TRAINS_AGENT_K8S_HOST_MOUNT=/root/.trains:/root/.trains trains-agent daemon --docker --force-current-version
Then let us know what happens and if you see the new worker it in the UI
And from what i understand, the static webapp also needs to talk to the fileserver and perhaps other components of the stack like the elastic search deployment or mongo db from my browser?
I can reach the UI from the browser (im using kubectl portforward if that makes any difference).
When i attach to one of the pods and run the command, i get the following output
ColossalAnt7 can you reach the UI from your browser?
When you open the UI, do you see any projects there?
Well, it's not the webserver that has to access the apiserver when you open the UI, it's your browser... The webserver component only serves the single page app to the browser, but it's the browser on which the single page app is running and from there it needs to communicate with the apiserver...
Ohh i see. I thought that the webserver was a slim fullstack app that passes the requests to the api server and gets them back.
So i guess that the api address it looks for is the same as what my browser thinks the webserver's url is except with an api. prefix (unless localhost) and a different port?
Yeah, I see all 3 projects that were there by default. I cloned an experiment on one of them to see it run on the service queue
Exactly - either a port if you're using http or prefix if you're using https, but not both
Now that i shut down my local trains server and run two port forwards:kubectl port-forward -n trains svc/webserver-service 9999:80 kubectl port-forward -n trains svc/apiserver-service 8008:8008
I can see my agents
But if i turn off the api server port forward i get an empty web server.
It is really weird to me that the webserver that is running on k8s needs my localhost connection to the API server to connect even though i gave the api-server URL to the helm chart.
Is there anything i can do to change that?
Interesting. Here is what led me to believe that the server i get in the UI and the server that i connect to inside the agents are the same:
in the agent, I looking in the TRAINS_API_HOST and then run nslookup on the url. Same with the WEB_API_HOST.root@trains-agent-584dfcc6cd-fxvkb:/# echo $TRAINS_API_HOST
`
root@trains-agent-584dfcc6cd-fxvkb:/# nslookup apiserver-service
Server: 10.100.0.10
Address: 10.100.0.10#53
Name: apiserver-service.trains.svc.cluster.local
Address: 10.100.138.234
root@trains-agent-584dfcc6cd-fxvkb:/# echo $TRAINS_WEB_HOST
root@trains-agent-584dfcc6cd-fxvkb:/# nslookup webserver-service
Server: 10.100.0.10
Address: 10.100.0.10#53
Name: webserver-service.trains.svc.cluster.local
Address: 10.100.129.136 When i look for all services in the trains namespace, this is what i get
❯ kubectl get svc -n trains
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
apiserver-service NodePort 10.100.138.234 <none> 8008:30008/TCP 138m
elasticsearch-service ClusterIP 10.100.28.19 <none> 9200/TCP 138m
fileserver-service NodePort 10.100.100.119 <none> 8081:30081/TCP 138m
mongo-service ClusterIP 10.100.216.254 <none> 27017/TCP 138m
redis ClusterIP 10.100.70.180 <none> 6379/TCP 138m
webserver-service NodePort 10.100.129.136 <none> 80:30080/TCP 138m `In both cases, we see that the apiserver is at 10.100.138.234 and the webserver is at 10.100.129.136
The only way i connected to the UI is by running the following command:kubectl port-forward -n trains svc/webserver-service 9999:80
and then browsing to http://localhost:9999/
There is one thing that seems weird. When i try to get a credentials config from the UI, here is what it says:~/trains.conf api { web_server:
api_server:
credentials { "access_key" = "..." "secret_key" = "..." } }
Hey FriendlySquid61 and SuccessfulKoala55 . I followed your guidance and am back with the results.
First of all, i changed the Hosts urls to follow the format of the default agentservices values in the helm chart.
Now they look like this:
` agent:
numberOfTrainsAgents: 1
nvidiaGpusPerAgent: 0
defaultBaseDocker: "nvidia/cuda"
agentVersion: ""
made the hosts into k8s dns
trainsApiHost: " "
trainsWebHost: " "
trainsFilesHost: " "
trainsGitUser: null
trainsGitPassword: null
trainsAccessKey: null
trainsSecretKey: null
awsAccessKeyId: null
awsSecretAccessKey: null
awsDefaultRegion: null
azureStorageAccount: null
azureStorageKey: null Turns out that this does the same thing as the full k8s dns that I wrote, since the agents are in the same
trains workspaces as the server. So basically i just used the long version before. I also reduced the number of agents in the deployment to 1 and run my manual dummy-agent so that i can control the
trains-agent daemon ` call
With this config, the agents still see themselves as connected. When i run trains-agent list
from my dummy agent this is what i get
` root@dummy-agent:/# trains-agent list
workers:
- company:
id: d1bd92a3b039400cbafc60a7a5b1e52b
name: trains
id: trains-agent-584dfcc6cd-fxvkb:gpuall
ip: 172.31.15.68
key: worker_d1bd92a3b039400cbafc60a7a5b1e52b___tests___trains-agent-584dfcc6cd-fxvkb:gpuall
last_activity_time: '2020-11-08T12:22:25.157024'
last_report_time: '2020-11-08T12:22:25.157024'
queues:- id: e3f7b34cbc1f4a0199045d5504b85b18
name: default
num_tasks: 0
register_time: '2020-11-08T12:07:49.649695'
register_timeout: 600
tags: []
user:
id: tests
name: tests
- id: e3f7b34cbc1f4a0199045d5504b85b18
- company:
id: d1bd92a3b039400cbafc60a7a5b1e52b
name: trains
id: dummy-agent:gpuall
ip: 172.31.43.220
key: worker_d1bd92a3b039400cbafc60a7a5b1e52b___tests___dummy-agent:gpuall
last_activity_time: '2020-11-08T12:22:37.414504'
last_report_time: '2020-11-08T12:22:37.414504'
queues:- id: e3f7b34cbc1f4a0199045d5504b85b18
name: default
num_tasks: 0
register_time: '2020-11-08T12:22:34.382837'
register_timeout: 600
tags: []
user:
id: tests
name: tests
- id: e3f7b34cbc1f4a0199045d5504b85b18
- company:
id: d1bd92a3b039400cbafc60a7a5b1e52b
name: trains
id: trains-services
ip: 172.31.0.170
key: worker_d1bd92a3b039400cbafc60a7a5b1e52b___tests___trains-services
last_activity_time: '2020-11-08T12:22:42.412209'
last_report_time: '2020-11-08T12:22:42.412209'
queues:- id: a0c0ab0fa2f94186abf265cd376f4530
name: services
num_tasks: 0
register_time: '2020-11-08T12:07:36.447078'
register_timeout: 600
tags: []
user:
id: tests
name: testsI tried creating a new queue in the UI called
oneonehowever, when i run the following command i get the following message:
root@dummy-agent:/# TRAINS_DOCKER_SKIP_GPUS_FLAG=1 TRAINS_AGENT_K8S_HOST_MOUNT=/root/.trains:/root/.trains trains-agent daemon --dock "nvidia/cuda" --force-current-version --queue oneone
- id: a0c0ab0fa2f94186abf265cd376f4530
trains_agent: ERROR: Could not find queue with name/id "oneone" It doesnt recognize the queue named oneone. However, if i run the same command and write
--queue default instead, it runs properly and another process running
trains-agent list ` can see it connected (this is what i showed you above).
I also tried to enqueue a task to the default queue, since both the agent deployment and my dummy agent are showed in the agent cli to be listening to the default queue. However, the task i enqueued stays in the pending stage.
On a related note, i tried to look at the trains-server api to see how i can get the queue id instead of the name, but that page in your docs seems to be broken
https://allegro.ai/docs/references/trains_api_ref/trains_api_ref.html
Let me know what you think, and thanks again for all your help.