ColossalAnt7

Moderator

4 Questions, 29 Answers

Active since 10 January 2023

Last activity 12 months ago

Reputation

Badges 1

29 × Eureka!

Questions 4
Answers 29

0 Votes

6 Answers

1K Views

0 Votes 6 Answers 1K Views

Thank You For Your Help So Far. I Have A Question About Trains Authentication And Privacy When Deploying On K8S. I Want Integrate Building A Trains-Server Into Our Iac. Now That I Got A Server To Work With An Agent Deployment Im Thinking About Authorizati

Thank you for your help so far. I have a question about trains authentication and privacy when deploying on k8s. I want integrate building a trains-server in...

mlops

4 years ago

0 Votes

25 Answers

1K Views

0 Votes 25 Answers 1K Views

Hey Guys, Anyone Knows What It Means If I Deployed A New Trains Server And When I Access It My Tab Looks Like This?

Hey guys, Anyone knows what it means if i deployed a new trains server and when i access it My tab looks like this?

clearml

4 years ago

0 Votes

29 Answers

1K Views

0 Votes 29 Answers 1K Views

Hey Guys, Another Question About Deploying My Own Trains Server. I Have A Trains-Server Deployed On My K8S Cluster Using The Trains Helm Chart (Which Is Awesome). Now I Want To Create A Deployment Running Trains-Agent As Specified In The [Trains-Helm Repo

Hey guys, another question about deploying my own trains server. I have a trains-server deployed on my k8s cluster using the trains helm chart (which is awes...

mlops

4 years ago

0 Votes

6 Answers

1K Views

0 Votes 6 Answers 1K Views

Hey Guys. I Tried Running The Pytorch Mnist Example On A Train-Agent By Running It Locally And Then Resetting The Experiment And Then Enqueue-Ing It To The Default Queue. All Works Well But It Seems The Environment Building Process Gets Stuck On A Manual

Hey guys. I tried running the pytorch mnist example on a train-agent by running it locally and then resetting the experiment and then enqueue-ing it to the d...

mlops pytorch

4 years ago

0 Hey Guys, Another Question About Deploying My Own Trains Server. I Have A Trains-Server Deployed On My K8S Cluster Using The Trains Helm Chart (Which Is Awesome). Now I Want To Create A Deployment Running Trains-Agent As Specified In The [Trains-Helm Repo

I can reach the UI from the browser (im using kubectl portforward if that makes any difference).

When i attach to one of the pods and run the command, i get the following output

4 years ago

Now that i shut down my local trains server and run two port forwards:
kubectl port-forward -n trains svc/webserver-service 9999:80 kubectl port-forward -n trains svc/apiserver-service 8008:8008I can see my agents
But if i turn off the api server port forward i get an empty web server.
It is really weird to me that the webserver that is running on k8s needs my localhost connection to the API server to connect even though i gave the api-server URL to the helm chart.

Is there anything i ca...

4 years ago

0 Thank You For Your Help So Far. I Have A Question About Trains Authentication And Privacy When Deploying On K8S. I Want Integrate Building A Trains-Server Into Our Iac. Now That I Got A Server To Work With An Agent Deployment Im Thinking About Authorizati

SuccessfulKoala55 So far, I only so how the credentials are passes in the config files. Can you point me to where it looks for env vars for authentication?

AgitatedDove14
I thought about the config maps for the credentials. Having the urls of each server componenet (api,web,file) makes sense. The problem with an external load balancer is that I expose the servers outside of the cluster, which im trying to avoid. It might be the case that my thinking about this is mistaken alltogether and ...

4 years ago

Exactly, the trains service does appear and completed an experiment correctly. I changed it to http://apiserver-service.trains.svc.cluster.local:8008 as mentioned in the beginning of the issue. Here is the output when i try to curl this address from inside the agents in the cluster.https://allegroai-trains.slack.com/archives/CTK20V944/p1604595233210500?thread_ts=1604593483.209000&cid=CTK20V944

4 years ago

Hey FriendlySquid61 and SuccessfulKoala55 . I followed your guidance and am back with the results.
First of all, i changed the Hosts urls to follow the format of the default agentservices values in the helm chart.
Now they look like this:
` agent:
numberOfTrainsAgents: 1
nvidiaGpusPerAgent: 0
defaultBaseDocker: "nvidia/cuda"
agentVersion: ""

made the hosts into k8s dns

trainsApiHost: " "
trainsWebHost: " "
trainsFilesHost: " "
trainsGitUser: null
t...

4 years ago

I also tried portforwarding the apiserver and discovered that something else was binding localhost:8008.
When i browsed there directly i got the same sort of respond that i get by portforwarding the api server to another port
But with a different server ID.

Turns out that my webserver on my k8s tried to access localhost:8008 and reached to trains server that was deployed on my local machine.

4 years ago

Ok thanks alot. Now everything makes more sense.
I have some follow up questions but i think they should go to their own threads 😃

4 years ago

Interesting. Here is what led me to believe that the server i get in the UI and the server that i connect to inside the agents are the same:
in the agent, I looking in the TRAINS_API_HOST and then run nslookup on the url. Same with the WEB_API_HOST.
root@trains-agent-584dfcc6cd-fxvkb:/# echo $TRAINS_API_HOST `
root@trains-agent-584dfcc6cd-fxvkb:/# nslookup apiserver-service
Server: 10.100.0.10
Address: 10.100.0.10#53

Name: apiserver-service.trains.svc.cluster.local
Address: 10.100.13...

4 years ago

I also made an dummy-agent pod by taking the deployment manifest and changing it so that the pod created sleeps instead of running trains-agent in its main process.
I then the installations manually and then run your command
apt-get update ; apt-get install -y curl python3-pip git; curl -sSL ` | sh ;
python3 -m pip install -U pip ;
python3 -m pip install trains-agent ;
TRAINS_DOCKER_SKIP_GPUS_FLAG=1 TRAINS_AGENT_K8S_HOST_MOUNT=/root/.trains:/root/.trains trains-agent daemon --docker --f...

4 years ago

Yeah, I see all 3 projects that were there by default. I cloned an experiment on one of them to see it run on the service queue

4 years ago

FriendlySquid61 I didnt notice that that the agent service could be configured. Looking inside the values.yml in the chart source code i see that the default value is http://apiserver-service:8008 . Ill try changing the agents value to it and see what happens.
SuccessfulKoala55 Ill give that a try to and write back what happens

I think ill have to continue this on sunday. Thank you so much for your help so far.
Before i liked the idea behind trains but now i also adore the dedicated sup...

4 years ago

AgitatedDove14 SuccessfulKoala55
Yes this makes alot more sense now. Thank you.
Ill give it a go. Once i have something that works ill make a github issue to see if its something you would like to add to the repo.

Thank you very much.

4 years ago

Ohh i see. I thought that the webserver was a slim fullstack app that passes the requests to the api server and gets them back.
So i guess that the api address it looks for is the same as what my browser thinks the webserver's url is except with an api. prefix (unless localhost) and a different port?

4 years ago

And from what i understand, the static webapp also needs to talk to the fileserver and perhaps other components of the stack like the elastic search deployment or mongo db from my browser?

4 years ago

0 Hey Guys. I Tried Running The Pytorch Mnist Example On A Train-Agent By Running It Locally And Then Resetting The Experiment And Then Enqueue-Ing It To The Default Queue. All Works Well But It Seems The Environment Building Process Gets Stuck On A Manual

Looking at your helm repo, it seems that the 0.16.2 version has not been pushed yet.
Is this something you can resolve on your end, or should i download the source of the helm chart and change some version flags myself?

4 years ago

This solved the problem. Thank you very much!

4 years ago

Im using the helm chart, but it might be 0.16.1 Ill try upgrading and get back to you.

4 years ago

0 Hey Guys, Anyone Knows What It Means If I Deployed A New Trains Server And When I Access It My Tab Looks Like This?

And thank you AgitatedDove14 TimelyPenguin76 too for your help
Honestly, this bug has been fudging me for a week.

4 years ago

0 Hey Guys, Anyone Knows What It Means If I Deployed A New Trains Server And When I Access It My Tab Looks Like This?

Here is the ps command output
` CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
d0a38a3fd514 allegroai/trains:latest "/opt/trains/wrapper…" 26 minutes ago Up 26 minutes 8008/tcp, 8080-8081/tcp, 0.0.0.0:8080->80/tcp trains-webserver
fa86fb49e928 allegroai/trains-agent-services:latest ...

4 years ago

0 Hey Guys, Anyone Knows What It Means If I Deployed A New Trains Server And When I Access It My Tab Looks Like This?

OMG! you are right!
I browsed in incognito and everything looks fine now thank you so much!

4 years ago

0 Hey Guys, Anyone Knows What It Means If I Deployed A New Trains Server And When I Access It My Tab Looks Like This?

4 years ago

0 Hey Guys, Anyone Knows What It Means If I Deployed A New Trains Server And When I Access It My Tab Looks Like This?

the console output goes like this
` WARNING: The TRAINS_HOST_IP variable is not set. Defaulting to a blank string.
WARNING: The TRAINS_AGENT_GIT_USER variable is not set. Defaulting to a blank string.
WARNING: The TRAINS_AGENT_GIT_PASS variable is not set. Defaulting to a blank string.
Creating network "trains_backend" with driver "bridge"
Creating network "trains_default" with the default driver
Creating trains-fileserver ... done
Creating trains-elastic ... done
Creating trains-mongo ...

4 years ago

0 Hey Guys, Anyone Knows What It Means If I Deployed A New Trains Server And When I Access It My Tab Looks Like This?

Do you mean the dev console in the browser? if so yes

4 years ago

0 Hey Guys, Anyone Knows What It Means If I Deployed A New Trains Server And When I Access It My Tab Looks Like This?

Looking at the APIserver logs again, I see the same errors. Specifically
[2020-11-03 17:37:27,433] [8] [WARNING] [trains.service_repo] Returned 400 for users.get_preferences in 2ms, msg=Invalid user id: id=c7e45a3f03d04d8d99151a6210522a5f, company=d1bd92a3b039400cbafc60a7a5b1e52b [2020-11-03 17:37:27,510] [8] [WARNING] [trains.service_repo] Returned 400 for users.get_current_user in 2ms, msg=Invalid user (failed loading user)

4 years ago

0 Hey Guys, Anyone Knows What It Means If I Deployed A New Trains Server And When I Access It My Tab Looks Like This?

I feel like im missing all of them 😥 .

I tried following the linux, k8s and helm deployment guides and no matter how much i refresh, my server always looks like this.

4 years ago

0 Hey Guys, Anyone Knows What It Means If I Deployed A New Trains Server And When I Access It My Tab Looks Like This?

Yes, this happens both with docker compose on my pc where all ports are open, and on k8s, where the other ports are open.
When i had another container take one of the ports, i got warnings during the deployment or the docker-compose up command

4 years ago

0 Hey Guys, Anyone Knows What It Means If I Deployed A New Trains Server And When I Access It My Tab Looks Like This?

Do you know who i can talk to to have the devs add this use case to their FAQ?

4 years ago

0 Hey Guys, Anyone Knows What It Means If I Deployed A New Trains Server And When I Access It My Tab Looks Like This?

Here is the network tab

4 years ago

0 Hey Guys, Anyone Knows What It Means If I Deployed A New Trains Server And When I Access It My Tab Looks Like This?

Thanks, much appreciated

4 years ago