Reputation
Badges 1
29 × Eureka!Interesting. Here is what led me to believe that the server i get in the UI and the server that i connect to inside the agents are the same:
in the agent, I looking in the TRAINS_API_HOST and then run nslookup on the url. Same with the WEB_API_HOST.root@trains-agent-584dfcc6cd-fxvkb:/# echo $TRAINS_API_HOST
`
root@trains-agent-584dfcc6cd-fxvkb:/# nslookup apiserver-service
Server: 10.100.0.10
Address: 10.100.0.10#53
Name: apiserver-service.trains.svc.cluster.local
Address: 10.100.13...
Exactly, the trains service does appear and completed an experiment correctly. I changed it to http://apiserver-service.trains.svc.cluster.local:8008 as mentioned in the beginning of the issue. Here is the output when i try to curl this address from inside the agents in the cluster.https://allegroai-trains.slack.com/archives/CTK20V944/p1604595233210500?thread_ts=1604593483.209000&cid=CTK20V944
Now that i shut down my local trains server and run two port forwards:kubectl port-forward -n trains svc/webserver-service 9999:80 kubectl port-forward -n trains svc/apiserver-service 8008:8008
I can see my agents
But if i turn off the api server port forward i get an empty web server.
It is really weird to me that the webserver that is running on k8s needs my localhost connection to the API server to connect even though i gave the api-server URL to the helm chart.
Is there anything i ca...
I also made an dummy-agent pod by taking the deployment manifest and changing it so that the pod created sleeps instead of running trains-agent in its main process.
I then the installations manually and then run your commandapt-get update ; apt-get install -y curl python3-pip git; curl -sSL
` | sh ;
python3 -m pip install -U pip ;
python3 -m pip install trains-agent ;
TRAINS_DOCKER_SKIP_GPUS_FLAG=1 TRAINS_AGENT_K8S_HOST_MOUNT=/root/.trains:/root/.trains trains-agent daemon --docker --f...
This solved the problem. Thank you very much!
Ok thanks alot. Now everything makes more sense.
I have some follow up questions but i think they should go to their own threads 😃
AgitatedDove14 SuccessfulKoala55
Yes this makes alot more sense now. Thank you.
Ill give it a go. Once i have something that works ill make a github issue to see if its something you would like to add to the repo.
Thank you very much.
Im using the helm chart, but it might be 0.16.1 Ill try upgrading and get back to you.
Yeah, I see all 3 projects that were there by default. I cloned an experiment on one of them to see it run on the service queue
I also tried portforwarding the apiserver and discovered that something else was binding localhost:8008.
When i browsed there directly i got the same sort of respond that i get by portforwarding the api server to another port
But with a different server ID.
Turns out that my webserver on my k8s tried to access localhost:8008 and reached to trains server that was deployed on my local machine.
I can reach the UI from the browser (im using kubectl portforward if that makes any difference).
When i attach to one of the pods and run the command, i get the following output
I feel like im missing all of them 😥 .
I tried following the linux, k8s and helm deployment guides and no matter how much i refresh, my server always looks like this.
FriendlySquid61 I didnt notice that that the agent service could be configured. Looking inside the values.yml in the chart source code i see that the default value is http://apiserver-service:8008 . Ill try changing the agents value to it and see what happens.
SuccessfulKoala55 Ill give that a try to and write back what happens
I think ill have to continue this on sunday. Thank you so much for your help so far.
Before i liked the idea behind trains but now i also adore the dedicated sup...
Do you mean the dev console in the browser? if so yes
the console output goes like this
` WARNING: The TRAINS_HOST_IP variable is not set. Defaulting to a blank string.
WARNING: The TRAINS_AGENT_GIT_USER variable is not set. Defaulting to a blank string.
WARNING: The TRAINS_AGENT_GIT_PASS variable is not set. Defaulting to a blank string.
Creating network "trains_backend" with driver "bridge"
Creating network "trains_default" with the default driver
Creating trains-fileserver ... done
Creating trains-elastic ... done
Creating trains-mongo ...
Here is the ps command output
` CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
d0a38a3fd514 allegroai/trains:latest "/opt/trains/wrapper…" 26 minutes ago Up 26 minutes 8008/tcp, 8080-8081/tcp, 0.0.0.0:8080->80/tcp trains-webserver
fa86fb49e928 allegroai/trains-agent-services:latest ...
Do you know who i can talk to to have the devs add this use case to their FAQ?
Here is the network tab
Thanks, much appreciated
Yes, this happens both with docker compose on my pc where all ports are open, and on k8s, where the other ports are open.
When i had another container take one of the ports, i got warnings during the deployment or the docker-compose up command
Hey FriendlySquid61 and SuccessfulKoala55 . I followed your guidance and am back with the results.
First of all, i changed the Hosts urls to follow the format of the default agentservices values in the helm chart.
Now they look like this:
` agent:
numberOfTrainsAgents: 1
nvidiaGpusPerAgent: 0
defaultBaseDocker: "nvidia/cuda"
agentVersion: ""
made the hosts into k8s dns
trainsApiHost: " "
trainsWebHost: " "
trainsFilesHost: " "
trainsGitUser: null
t...
And thank you AgitatedDove14 TimelyPenguin76 too for your help
Honestly, this bug has been fudging me for a week.
OMG! you are right!
I browsed in incognito and everything looks fine now thank you so much!
SuccessfulKoala55 So far, I only so how the credentials are passes in the config files. Can you point me to where it looks for env vars for authentication?
AgitatedDove14
I thought about the config maps for the credentials. Having the urls of each server componenet (api,web,file) makes sense. The problem with an external load balancer is that I expose the servers outside of the cluster, which im trying to avoid. It might be the case that my thinking about this is mistaken alltogether and ...
Looking at the APIserver logs again, I see the same errors. Specifically[2020-11-03 17:37:27,433] [8] [WARNING] [trains.service_repo] Returned 400 for users.get_preferences in 2ms, msg=Invalid user id: id=c7e45a3f03d04d8d99151a6210522a5f, company=d1bd92a3b039400cbafc60a7a5b1e52b [2020-11-03 17:37:27,510] [8] [WARNING] [trains.service_repo] Returned 400 for users.get_current_user in 2ms, msg=Invalid user (failed loading user)
Ohh i see. I thought that the webserver was a slim fullstack app that passes the requests to the api server and gets them back.
So i guess that the api address it looks for is the same as what my browser thinks the webserver's url is except with an api. prefix (unless localhost) and a different port?
And from what i understand, the static webapp also needs to talk to the fileserver and perhaps other components of the stack like the elastic search deployment or mongo db from my browser?
Looking at your helm repo, it seems that the 0.16.2 version has not been pushed yet.
Is this something you can resolve on your end, or should i download the source of the helm chart and change some version flags myself?