Hey FriendlySquid61 and SuccessfulKoala55 . I followed your guidance and am back with the results.
First of all, i changed the Hosts urls to follow the format of the default agentservices values in the helm chart.
Now they look like this:
` agent:
numberOfTrainsAgents: 1
nvidiaGpusPerAgent: 0
defaultBaseDocker: "nvidia/cuda"
agentVersion: ""
made the hosts into k8s dns
trainsApiHost: " "
trainsWebHost: " "
trainsFilesHost: " "
trainsGitUser: null
trainsGitPassword: null
trainsAccessKey: null
trainsSecretKey: null
awsAccessKeyId: null
awsSecretAccessKey: null
awsDefaultRegion: null
azureStorageAccount: null
azureStorageKey: null Turns out that this does the same thing as the full k8s dns that I wrote, since the agents are in the same
trains workspaces as the server. So basically i just used the long version before. I also reduced the number of agents in the deployment to 1 and run my manual dummy-agent so that i can control the
trains-agent daemon ` call
With this config, the agents still see themselves as connected. When i run trains-agent list
from my dummy agent this is what i get
` root@dummy-agent:/# trains-agent list
workers:
- company:
id: d1bd92a3b039400cbafc60a7a5b1e52b
name: trains
id: trains-agent-584dfcc6cd-fxvkb:gpuall
ip: 172.31.15.68
key: worker_d1bd92a3b039400cbafc60a7a5b1e52b___tests___trains-agent-584dfcc6cd-fxvkb:gpuall
last_activity_time: '2020-11-08T12:22:25.157024'
last_report_time: '2020-11-08T12:22:25.157024'
queues:- id: e3f7b34cbc1f4a0199045d5504b85b18
name: default
num_tasks: 0
register_time: '2020-11-08T12:07:49.649695'
register_timeout: 600
tags: []
user:
id: tests
name: tests
- id: e3f7b34cbc1f4a0199045d5504b85b18
- company:
id: d1bd92a3b039400cbafc60a7a5b1e52b
name: trains
id: dummy-agent:gpuall
ip: 172.31.43.220
key: worker_d1bd92a3b039400cbafc60a7a5b1e52b___tests___dummy-agent:gpuall
last_activity_time: '2020-11-08T12:22:37.414504'
last_report_time: '2020-11-08T12:22:37.414504'
queues:- id: e3f7b34cbc1f4a0199045d5504b85b18
name: default
num_tasks: 0
register_time: '2020-11-08T12:22:34.382837'
register_timeout: 600
tags: []
user:
id: tests
name: tests
- id: e3f7b34cbc1f4a0199045d5504b85b18
- company:
id: d1bd92a3b039400cbafc60a7a5b1e52b
name: trains
id: trains-services
ip: 172.31.0.170
key: worker_d1bd92a3b039400cbafc60a7a5b1e52b___tests___trains-services
last_activity_time: '2020-11-08T12:22:42.412209'
last_report_time: '2020-11-08T12:22:42.412209'
queues:- id: a0c0ab0fa2f94186abf265cd376f4530
name: services
num_tasks: 0
register_time: '2020-11-08T12:07:36.447078'
register_timeout: 600
tags: []
user:
id: tests
name: testsI tried creating a new queue in the UI called
oneonehowever, when i run the following command i get the following message:
root@dummy-agent:/# TRAINS_DOCKER_SKIP_GPUS_FLAG=1 TRAINS_AGENT_K8S_HOST_MOUNT=/root/.trains:/root/.trains trains-agent daemon --dock "nvidia/cuda" --force-current-version --queue oneone
- id: a0c0ab0fa2f94186abf265cd376f4530
trains_agent: ERROR: Could not find queue with name/id "oneone" It doesnt recognize the queue named oneone. However, if i run the same command and write
--queue default instead, it runs properly and another process running
trains-agent list ` can see it connected (this is what i showed you above).
I also tried to enqueue a task to the default queue, since both the agent deployment and my dummy agent are showed in the agent cli to be listening to the default queue. However, the task i enqueued stays in the pending stage.
On a related note, i tried to look at the trains-server api to see how i can get the queue id instead of the name, but that page in your docs seems to be broken
https://allegro.ai/docs/references/trains_api_ref/trains_api_ref.html
Let me know what you think, and thanks again for all your help.