Answered

Hey Guys, Another Question About Deploying My Own Trains Server. I Have A Trains-Server Deployed On My K8S Cluster Using The Trains Helm Chart (Which Is Awesome). Now I Want To Create A Deployment Running Trains-Agent As Specified In The [Trains-Helm Repo

Hey guys, another question about deploying my own trains server.
I have a trains-server deployed on my k8s cluster using the trains helm chart (which is awesome).
Now i want to create a deployment running trains-agent as specified in the trains-helm repo.

However, i do not want to create external DNS records to point the agents to but rather utilize k8s' dns mechanism. So I upgraded my helm release with the following values:
` agent:
numberOfTrainsAgents: 3
nvidiaGpusPerAgent: 0
defaultBaseDocker: "nvidia/cuda"
agentVersion: ""

made the hosts into k8s dns

trainsApiHost: " "
trainsWebHost: " "
trainsFilesHost: " "
trainsGitUser: null
trainsGitPassword: null
trainsAccessKey: null
trainsSecretKey: null
awsAccessKeyId: null
awsSecretAccessKey: null
awsDefaultRegion: null
azureStorageAccount: null
azureStorageKey: null `I attached to the workers and made sure that the dns of this urls resolves correctly inside my cluster. The agents seems to be running correctly (log attached to the thread).
However, the agents dont show up in the agents & queue section of the web server.
Does anyone know why this is happening or how to get to the bottom of this?

** Edit ** :
I run trains-agent list inside one of the agent containers and it seemed to recognize all 4 agents (the agent service and the 3 normal ones)
Here is the output
` workers:

company:
id: d1bd92a3b039400cbafc60a7a5b1e52b
name: trains
id: trains-agent-76b688794-2g7dp:gpuall
ip: 172.31.8.243
key: worker_d1bd92a3b039400cbafc60a7a5b1e52b___tests___trains-agent-76b688794-2g7dp:gpuall
last_activity_time: '2020-11-05T16:28:12.451448'
last_report_time: '2020-11-05T16:28:12.451448'
queues:
- id: 3b853e1c7c864789ac9f4cf55312348d
  name: default
  num_tasks: 0
  register_time: '2020-11-05T16:11:36.363516'
  register_timeout: 600
  tags: []
  user:
  id: tests
  name: tests
company:
id: d1bd92a3b039400cbafc60a7a5b1e52b
name: trains
id: trains-agent-76b688794-px7kk:gpuall
ip: 172.31.15.14
key: worker_d1bd92a3b039400cbafc60a7a5b1e52b___tests___trains-agent-76b688794-px7kk:gpuall
last_activity_time: '2020-11-05T16:28:21.336209'
last_report_time: '2020-11-05T16:28:21.336209'
queues:
- id: 3b853e1c7c864789ac9f4cf55312348d
  name: default
  num_tasks: 0
  register_time: '2020-11-05T16:11:45.289150'
  register_timeout: 600
  tags: []
  user:
  id: tests
  name: tests
company:
id: d1bd92a3b039400cbafc60a7a5b1e52b
name: trains
id: trains-agent-76b688794-r6n9q:gpuall
ip: 172.31.1.191
key: worker_d1bd92a3b039400cbafc60a7a5b1e52b___tests___trains-agent-76b688794-r6n9q:gpuall
last_activity_time: '2020-11-05T16:28:12.551249'
last_report_time: '2020-11-05T16:28:12.551249'
queues:
- id: 3b853e1c7c864789ac9f4cf55312348d
  name: default
  num_tasks: 0
  register_time: '2020-11-05T16:11:36.538503'
  register_timeout: 600
  tags: []
  user:
  id: tests
  name: tests
company:
id: d1bd92a3b039400cbafc60a7a5b1e52b
name: trains
id: trains-services
ip: 172.31.21.231
key: worker_d1bd92a3b039400cbafc60a7a5b1e52b___tests___trains-services
last_activity_time: '2020-11-05T16:28:07.887185'
last_report_time: '2020-11-05T16:28:07.887185'
queues:
- id: ed3b1b330f404738b10dcba2174f97eb
  name: services
  num_tasks: 0
  register_time: '2020-11-05T16:11:01.842876'
  register_timeout: 600
  tags: []
  user:
  id: tests
  name: tests `

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					ColossalAnt7
				
					0
					 × 1

Votes Newest

Answers 29

Just want to make sure the agents are reporting but for some reason you're missing it 🙂

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Exactly, the trains service does appear and completed an experiment correctly. I changed it to http://apiserver-service.trains.svc.cluster.local:8008 as mentioned in the beginning of the issue. Here is the output when i try to curl this address from inside the agents in the cluster.https://allegroai-trains.slack.com/archives/CTK20V944/p1604595233210500?thread_ts=1604593483.209000&cid=CTK20V944

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					ColossalAnt7
				
					0
					 × 1

Well, it's not the webserver that has to access the apiserver when you open the UI, it's your browser... The webserver component only serves the single page app to the browser, but it's the browser on which the single page app is running and from there it needs to communicate with the apiserver...

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

FriendlySquid61 I didnt notice that that the agent service could be configured. Looking inside the values.yml in the chart source code i see that the default value is http://apiserver-service:8008 . Ill try changing the agents value to it and see what happens.
SuccessfulKoala55 Ill give that a try to and write back what happens

I think ill have to continue this on sunday. Thank you so much for your help so far.
Before i liked the idea behind trains but now i also adore the dedicated support.

Have a good weekend

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					ColossalAnt7
				
					0
					 × 1

Great, let us know how it goes.
Have a great weekend!

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					FriendlySquid61
				
					0

Now that i shut down my local trains server and run two port forwards:
kubectl port-forward -n trains svc/webserver-service 9999:80 kubectl port-forward -n trains svc/apiserver-service 8008:8008I can see my agents
But if i turn off the api server port forward i get an empty web server.
It is really weird to me that the webserver that is running on k8s needs my localhost connection to the API server to connect even though i gave the api-server URL to the helm chart.

Is there anything i can do to change that?

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					ColossalAnt7
				
					0
					 × 1

Ok thanks alot. Now everything makes more sense.
I have some follow up questions but i think they should go to their own threads 😃

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					ColossalAnt7
				
					0
					 × 1

Ohh i see. I thought that the webserver was a slim fullstack app that passes the requests to the api server and gets them back.
So i guess that the api address it looks for is the same as what my browser thinks the webserver's url is except with an api. prefix (unless localhost) and a different port?

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					ColossalAnt7
				
					0
					 × 1

Interesting. Here is what led me to believe that the server i get in the UI and the server that i connect to inside the agents are the same:
in the agent, I looking in the TRAINS_API_HOST and then run nslookup on the url. Same with the WEB_API_HOST.
root@trains-agent-584dfcc6cd-fxvkb:/# echo $TRAINS_API_HOST `
root@trains-agent-584dfcc6cd-fxvkb:/# nslookup apiserver-service
Server: 10.100.0.10
Address: 10.100.0.10#53

Name: apiserver-service.trains.svc.cluster.local
Address: 10.100.138.234

root@trains-agent-584dfcc6cd-fxvkb:/# echo $TRAINS_WEB_HOST

root@trains-agent-584dfcc6cd-fxvkb:/# nslookup webserver-service
Server: 10.100.0.10
Address: 10.100.0.10#53

Name: webserver-service.trains.svc.cluster.local
Address: 10.100.129.136 When i look for all services in the trains namespace, this is what i get ❯ kubectl get svc -n trains
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
apiserver-service NodePort 10.100.138.234 <none> 8008:30008/TCP 138m
elasticsearch-service ClusterIP 10.100.28.19 <none> 9200/TCP 138m
fileserver-service NodePort 10.100.100.119 <none> 8081:30081/TCP 138m
mongo-service ClusterIP 10.100.216.254 <none> 27017/TCP 138m
redis ClusterIP 10.100.70.180 <none> 6379/TCP 138m
webserver-service NodePort 10.100.129.136 <none> 80:30080/TCP 138m `In both cases, we see that the apiserver is at 10.100.138.234 and the webserver is at 10.100.129.136

The only way i connected to the UI is by running the following command:
kubectl port-forward -n trains svc/webserver-service 9999:80and then browsing to http://localhost:9999/

There is one thing that seems weird. When i try to get a credentials config from the UI, here is what it says:
~/trains.conf api { web_server: api_server: credentials { "access_key" = "..." "secret_key" = "..." } }

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					ColossalAnt7
				
					0
					 × 1

ColossalAnt7 if you enqueue an experiment into the queue the 3 agents are monitoring, does one of the agents starts running the experiment? (I think changing the setup to 1 agents will be easier to debug)

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

ColossalAnt7 can you try connecting to one of the trains-agent pods and run trains-agent manually using the following command:
TRAINS_DOCKER_SKIP_GPUS_FLAG=1 TRAINS_AGENT_K8S_HOST_MOUNT=/root/.trains:/root/.trains trains-agent daemon --docker --force-current-versionThen let us know what happens and if you see the new worker it in the UI

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					FriendlySquid61
				
					0

Hey FriendlySquid61 and SuccessfulKoala55 . I followed your guidance and am back with the results.
First of all, i changed the Hosts urls to follow the format of the default agentservices values in the helm chart.
Now they look like this:
` agent:
numberOfTrainsAgents: 1
nvidiaGpusPerAgent: 0
defaultBaseDocker: "nvidia/cuda"
agentVersion: ""

made the hosts into k8s dns

trainsApiHost: " "
trainsWebHost: " "
trainsFilesHost: " "
trainsGitUser: null
trainsGitPassword: null
trainsAccessKey: null
trainsSecretKey: null
awsAccessKeyId: null
awsSecretAccessKey: null
awsDefaultRegion: null
azureStorageAccount: null
azureStorageKey: null Turns out that this does the same thing as the full k8s dns that I wrote, since the agents are in the same trains workspaces as the server. So basically i just used the long version before. I also reduced the number of agents in the deployment to 1 and run my manual dummy-agent so that i can control the trains-agent daemon ` call

With this config, the agents still see themselves as connected. When i run trains-agent list from my dummy agent this is what i get
` root@dummy-agent:/# trains-agent list
workers:

company:
id: d1bd92a3b039400cbafc60a7a5b1e52b
name: trains
id: trains-agent-584dfcc6cd-fxvkb:gpuall
ip: 172.31.15.68
key: worker_d1bd92a3b039400cbafc60a7a5b1e52b___tests___trains-agent-584dfcc6cd-fxvkb:gpuall
last_activity_time: '2020-11-08T12:22:25.157024'
last_report_time: '2020-11-08T12:22:25.157024'
queues:
- id: e3f7b34cbc1f4a0199045d5504b85b18
  name: default
  num_tasks: 0
  register_time: '2020-11-08T12:07:49.649695'
  register_timeout: 600
  tags: []
  user:
  id: tests
  name: tests
company:
id: d1bd92a3b039400cbafc60a7a5b1e52b
name: trains
id: dummy-agent:gpuall
ip: 172.31.43.220
key: worker_d1bd92a3b039400cbafc60a7a5b1e52b___tests___dummy-agent:gpuall
last_activity_time: '2020-11-08T12:22:37.414504'
last_report_time: '2020-11-08T12:22:37.414504'
queues:
- id: e3f7b34cbc1f4a0199045d5504b85b18
  name: default
  num_tasks: 0
  register_time: '2020-11-08T12:22:34.382837'
  register_timeout: 600
  tags: []
  user:
  id: tests
  name: tests
company:
id: d1bd92a3b039400cbafc60a7a5b1e52b
name: trains
id: trains-services
ip: 172.31.0.170
key: worker_d1bd92a3b039400cbafc60a7a5b1e52b___tests___trains-services
last_activity_time: '2020-11-08T12:22:42.412209'
last_report_time: '2020-11-08T12:22:42.412209'
queues:
- id: a0c0ab0fa2f94186abf265cd376f4530
  name: services
  num_tasks: 0
  register_time: '2020-11-08T12:07:36.447078'
  register_timeout: 600
  tags: []
  user:
  id: tests
  name: tests I tried creating a new queue in the UI called oneone however, when i run the following command i get the following message: root@dummy-agent:/# TRAINS_DOCKER_SKIP_GPUS_FLAG=1 TRAINS_AGENT_K8S_HOST_MOUNT=/root/.trains:/root/.trains trains-agent daemon --dock "nvidia/cuda" --force-current-version --queue oneone

trains_agent: ERROR: Could not find queue with name/id "oneone" It doesnt recognize the queue named oneone. However, if i run the same command and write --queue default instead, it runs properly and another process running trains-agent list ` can see it connected (this is what i showed you above).

I also tried to enqueue a task to the default queue, since both the agent deployment and my dummy agent are showed in the agent cli to be listening to the default queue. However, the task i enqueued stays in the pending stage.

On a related note, i tried to look at the trains-server api to see how i can get the queue id instead of the name, but that page in your docs seems to be broken
https://allegro.ai/docs/references/trains_api_ref/trains_api_ref.html

Let me know what you think, and thanks again for all your help.

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					ColossalAnt7
				
					0
					 × 1

When you open the UI, do you see any projects there?

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					FriendlySquid61
				
					0

ColossalAnt7 can you reach the UI from your browser?

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Exactly - either a port if you're using http or prefix if you're using https, but not both

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

That's great, from that I understand that the trains-services worker does appear in the UI, is it correct? Did the task run? Did you change the trainsApiHost under agentservices in the values.yaml?

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					FriendlySquid61
				
					0

The question is, assuming they're all using http://apiserver-service:8008 - can you find out which server they're actually connecting to?

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Or - which api-server the UI is actually connecting to? 🙂

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					FriendlySquid61
				
					0

From your description, it clearly seems that the Agents are connecting to a different Trains Server... This was the whole point of using a different queue name, since every server srats out with the default queue 🙂

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

No, I was wrong... 😞 - the error is actually returned from the apiserver itself, so that works...

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

And from what i understand, the static webapp also needs to talk to the fileserver and perhaps other components of the stack like the elastic search deployment or mongo db from my browser?

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					ColossalAnt7
				
					0
					 × 1

Apiserver and fileserver, that's it 🙂

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Yeah, I see all 3 projects that were there by default. I cloned an experiment on one of them to see it run on the service queue

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					ColossalAnt7
				
					0
					 × 1

I'll keep looking 🙂

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

I also tried portforwarding the apiserver and discovered that something else was binding localhost:8008.
When i browsed there directly i got the same sort of respond that i get by portforwarding the api server to another port
But with a different server ID.

Turns out that my webserver on my k8s tried to access localhost:8008 and reached to trains server that was deployed on my local machine.

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					ColossalAnt7
				
					0
					 × 1

I also made an dummy-agent pod by taking the deployment manifest and changing it so that the pod created sleeps instead of running trains-agent in its main process.
I then the installations manually and then run your command
`apt-get update ; apt-get install -y curl python3-pip git; curl -sSL` `| sh ; python3 -m pip install -U pip ; python3 -m pip install trains-agent ; TRAINS_DOCKER_SKIP_GPUS_FLAG=1 TRAINS_AGENT_K8S_HOST_MOUNT=/root/.trains:/root/.trains trains-agent daemon --docker --force-current-version`Here is the output (now its not complaining about an identical worker id running) but the UI stills shows nothing
` Current configuration (trains_agent v0.16.1, location: /root/trains.conf):

agent.worker_id =
agent.worker_name = dummy-agent
agent.force_git_ssh_protocol = false
agent.python_binary =
agent.package_manager.type = pip
agent.package_manager.pip_version = <20.2
agent.package_manager.system_site_packages = false
agent.package_manager.force_upgrade = false
agent.package_manager.conda_channels.0 = defaults
agent.package_manager.conda_channels.1 = conda-forge
agent.package_manager.conda_channels.2 = pytorch
agent.package_manager.torch_nightly = false
agent.venvs_dir = /root/.trains/venvs-builds.1
agent.vcs_cache.enabled = true
agent.vcs_cache.path = /root/.trains/vcs-cache.1
agent.venv_update.enabled = false
agent.pip_download_cache.enabled = true
agent.pip_download_cache.path = /root/.trains/pip-download-cache
agent.translate_ssh = true
agent.reload_config = false
agent.docker_pip_cache = /root/.trains/pip-cache
agent.docker_apt_cache = /root/.trains/apt-cache.1
agent.docker_force_pull = false
agent.default_docker.image = nvidia/cuda:10.1-runtime-ubuntu18.04
agent.default_python = 3.8
agent.cuda_version = 111
agent.cudnn_version = 0
api.version = 1.5
api.verify_certificate = true
api.default_version = 1.5
api.http.max_req_size = 15728640
api.http.retries.total = 240
api.http.retries.connect = 240
api.http.retries.read = 240
api.http.retries.redirect = 240
api.http.retries.status = 240
api.http.retries.backoff_factor = 1.0
api.http.retries.backoff_max = 120.0
api.http.wait_on_maintenance_forever = true
api.http.pool_maxsize = 512
api.http.pool_connections = 512
api.api_server =
sdk.storage.cache.default_base_dir = ~/.trains/cache
sdk.storage.cache.size.min_free_bytes = 10GB
sdk.storage.direct_access.0.url = file://*
sdk.metrics.file_history_size = 100
sdk.metrics.matplotlib_untitled_history_size = 100
sdk.metrics.images.format = JPEG
sdk.metrics.images.quality = 87
sdk.metrics.images.subsampling = 0
sdk.metrics.tensorboard_single_series_per_graph = false
sdk.network.metrics.file_upload_threads = 4
sdk.network.metrics.file_upload_starvation_warning_sec = 120
sdk.network.iteration.max_retries_on_server_error = 5
sdk.network.iteration.retry_backoff_factor_sec = 10
sdk.aws.s3.key =
sdk.aws.s3.region =
sdk.aws.boto3.pool_connections = 512
sdk.aws.boto3.max_multipart_concurrency = 16
sdk.log.null_log_propagate = false
sdk.log.task_log_buffer_capacity = 66
sdk.log.disable_urllib3_info = true
sdk.development.task_reuse_time_window_in_hours = 72.0
sdk.development.vcs_repo_detect_async = true
sdk.development.store_uncommitted_code_diff = true
sdk.development.support_stopping = true
sdk.development.default_output_uri =
sdk.development.force_analyze_entire_repo = false
sdk.development.suppress_update_message = false
sdk.development.detect_with_pip_freeze = false
sdk.development.worker.report_period_sec = 2
sdk.development.worker.ping_period_sec = 30
sdk.development.worker.log_stdout = true
sdk.development.worker.report_global_mem_used = false

Running in Docker mode (v19.03 and above) - using default docker image: nvidia/cuda:10.1-runtime-ubuntu18.04 running python3

Failed creating temporary copy of ~/.ssh for git credential
Running TRAINS-AGENT daemon in background mode, writing stdout/stderr to /tmp/.trains_agent_daemon_outrhomu1ms.txt `

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					ColossalAnt7
				
					0
					 × 1

Also, I would try using a new queue (i.e. create a new queue from the UI, and configuring the agent to use that queue) - this will show us for certain if the agent is talking to the right server...

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Just making sure, you changed both the agent one and the agent-services one?

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					FriendlySquid61
				
					0

I can reach the UI from the browser (im using kubectl portforward if that makes any difference).

When i attach to one of the pods and run the command, i get the following output

  				
Posted 
	5 years ago

					More
				  		
  Report
		
					ColossalAnt7
				
					0
					 × 1

Write your answer

2K Views

29 Answers

5 years ago

2 years ago