Hi Guys, I’M Trying To Install It My Lab Server, But When I Try To Create Credentials, It Says Error And Gives More Info: Error 301 : Invalid User Id: Id=F46262Bde88B4928997351A657901D8B, Company=D1Bd92A3B039400Cbafc60A7A5B1E52B

Answered

Hi guys, i’m trying to install it my lab server, but when i try to create credentials, it says error and gives more info:
Error 301 : Invalid user id: id=f46262bde88b4928997351a657901d8b, company=d1bd92a3b039400cbafc60a7a5b1e52b

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					PompousHawk82
				
					0
					 × 1

Votes Newest

Answers 30

Can I assume that if we have two agents spinning the same experiment, your code will take it from there?

Is this true ?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

yes

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					PompousHawk82
				
					0
					 × 1

I think that is good enough

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Basically I think I'm asking, is your code multi-node enabled to begin with ?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Is this a correct assumption ?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Ok, i’ll try to do that asap

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					PompousHawk82
				
					0
					 × 1

i think it can only run on multiple GPU at one node

Okay, the first step is to make sure your code is multi-node enabled, there is no magic for that 🙂

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

but at least it’s working now

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					PompousHawk82
				
					0
					 × 1

I’ll get back to you after i get this done

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					PompousHawk82
				
					0
					 × 1

I’ve been added multi-node support for my code, and i found our lab seems only have shared user file because i installed trains on one node, but it doesn’t appear on the others

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					PompousHawk82
				
					0
					 × 1

Not for now, i think it can only run on multiple GPU at one node

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					PompousHawk82
				
					0
					 × 1

PompousHawk82 what do you mean by ?

but the thing is that i can only use master to log everything

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

you mean two agents on two nodes?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					PompousHawk82
				
					0
					 × 1

(sure, we can try, conda is sometime flaky but is supported)
specify conda as the package manager:https://github.com/allegroai/trains-agent/blob/9a3f950ac689c50ba3415c42749a4bd8059e89a7/docs/trains.conf#L49
2. make sure trains-agent is install on both nodes
3. assuming you already have an experiment in the system, right click on the experiment and clone it. Then press on the ID button next to the experiment name, and copy the task ID
4. ssh to each node and run:
trains-agent execute --id <past_task_id_here>Let's see how that goes 🙂

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

And do you need to run your code inside a docker, or is venv enough ?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

and i found our lab seems only have shared user file because i installed trains on one node, but it doesn’t appear on the others

Do you mean there is no shared filesystem among the different machines ?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Yes, let's assume we have a task with id aabbcc
On two different machines you can do the following:
trains-agent execute --docker --id aabbccThis means you manually spin two simultaneous copies of the same experiment, once they are up and running, will your code be able to make the connection between them? (i.e. openmpi torch distribute etc?)

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

never done this before, let me do a quick search

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					PompousHawk82
				
					0
					 × 1

yes

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					PompousHawk82
				
					0
					 × 1

yes

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					PompousHawk82
				
					0
					 × 1

i’m just curious about how does trains server on different nodes communicate about the task queue

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					PompousHawk82
				
					0
					 × 1

but the thing is that i can only use master to log everything

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					PompousHawk82
				
					0
					 × 1

Yeah, i’m done with the test, not i can run as what you said

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					PompousHawk82
				
					0
					 × 1

we all use conda, guess not need for docker

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					PompousHawk82
				
					0
					 × 1

it’s shared but only user files, everything under ~/ directory

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					PompousHawk82
				
					0
					 × 1

Correct

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

This is assuming you can just run two copies of your code, and they will become aware of one another.

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

So that means your home folder is always mapped to ~/ on any machine you ssh to ?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

i’m just curious about how does trains server on different nodes communicate about the task queue

We start manual, we tell the agent just execute the task (notice we never enqueued it), if all goes well we will get to multi-node part 🙂

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

i see, now we are trying to let the agent pop up the experiment separately and see if they can communicate with each other, right?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					PompousHawk82
				
					0
					 × 1

Write your answer

1K Views

30 Answers

4 years ago

one year ago