Unanswered

Hey Guys, Another Question About Deploying My Own Trains Server. I Have A Trains-Server Deployed On My K8S Cluster Using The Trains Helm Chart (Which Is Awesome). Now I Want To Create A Deployment Running Trains-Agent As Specified In The [Trains-Helm Repo

Hey FriendlySquid61 and SuccessfulKoala55 . I followed your guidance and am back with the results.
First of all, i changed the Hosts urls to follow the format of the default agentservices values in the helm chart.
Now they look like this:
` agent:
numberOfTrainsAgents: 1
nvidiaGpusPerAgent: 0
defaultBaseDocker: "nvidia/cuda"
agentVersion: ""

made the hosts into k8s dns

trainsApiHost: " "
trainsWebHost: " "
trainsFilesHost: " "
trainsGitUser: null
trainsGitPassword: null
trainsAccessKey: null
trainsSecretKey: null
awsAccessKeyId: null
awsSecretAccessKey: null
awsDefaultRegion: null
azureStorageAccount: null
azureStorageKey: null Turns out that this does the same thing as the full k8s dns that I wrote, since the agents are in the same trains workspaces as the server. So basically i just used the long version before. I also reduced the number of agents in the deployment to 1 and run my manual dummy-agent so that i can control the trains-agent daemon ` call

With this config, the agents still see themselves as connected. When i run trains-agent list from my dummy agent this is what i get
` root@dummy-agent:/# trains-agent list
workers:

company:
id: d1bd92a3b039400cbafc60a7a5b1e52b
name: trains
id: trains-agent-584dfcc6cd-fxvkb:gpuall
ip: 172.31.15.68
key: worker_d1bd92a3b039400cbafc60a7a5b1e52b___tests___trains-agent-584dfcc6cd-fxvkb:gpuall
last_activity_time: '2020-11-08T12:22:25.157024'
last_report_time: '2020-11-08T12:22:25.157024'
queues:
- id: e3f7b34cbc1f4a0199045d5504b85b18
  name: default
  num_tasks: 0
  register_time: '2020-11-08T12:07:49.649695'
  register_timeout: 600
  tags: []
  user:
  id: tests
  name: tests
company:
id: d1bd92a3b039400cbafc60a7a5b1e52b
name: trains
id: dummy-agent:gpuall
ip: 172.31.43.220
key: worker_d1bd92a3b039400cbafc60a7a5b1e52b___tests___dummy-agent:gpuall
last_activity_time: '2020-11-08T12:22:37.414504'
last_report_time: '2020-11-08T12:22:37.414504'
queues:
- id: e3f7b34cbc1f4a0199045d5504b85b18
  name: default
  num_tasks: 0
  register_time: '2020-11-08T12:22:34.382837'
  register_timeout: 600
  tags: []
  user:
  id: tests
  name: tests
company:
id: d1bd92a3b039400cbafc60a7a5b1e52b
name: trains
id: trains-services
ip: 172.31.0.170
key: worker_d1bd92a3b039400cbafc60a7a5b1e52b___tests___trains-services
last_activity_time: '2020-11-08T12:22:42.412209'
last_report_time: '2020-11-08T12:22:42.412209'
queues:
- id: a0c0ab0fa2f94186abf265cd376f4530
  name: services
  num_tasks: 0
  register_time: '2020-11-08T12:07:36.447078'
  register_timeout: 600
  tags: []
  user:
  id: tests
  name: tests I tried creating a new queue in the UI called oneone however, when i run the following command i get the following message: root@dummy-agent:/# TRAINS_DOCKER_SKIP_GPUS_FLAG=1 TRAINS_AGENT_K8S_HOST_MOUNT=/root/.trains:/root/.trains trains-agent daemon --dock "nvidia/cuda" --force-current-version --queue oneone

trains_agent: ERROR: Could not find queue with name/id "oneone" It doesnt recognize the queue named oneone. However, if i run the same command and write --queue default instead, it runs properly and another process running trains-agent list ` can see it connected (this is what i showed you above).

I also tried to enqueue a task to the default queue, since both the agent deployment and my dummy agent are showed in the agent cli to be listening to the default queue. However, the task i enqueued stays in the pending stage.

On a related note, i tried to look at the trains-server api to see how i can get the queue id instead of the name, but that page in your docs seems to be broken
https://allegro.ai/docs/references/trains_api_ref/trains_api_ref.html

Let me know what you think, and thanks again for all your help.

  				
Posted 
	4 years ago

					More  		
  Report
		
					ColossalAnt7
				
					0
					 × 1

211 Views

0 Answers

4 years ago

2 years ago