Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hi Guys, I’M Trying To Install It My Lab Server, But When I Try To Create Credentials, It Says Error And Gives More Info: Error 301 : Invalid User Id: Id=F46262Bde88B4928997351A657901D8B, Company=D1Bd92A3B039400Cbafc60A7A5B1E52B

Hi guys, i’m trying to install it my lab server, but when i try to create credentials, it says error and gives more info:
Error 301 : Invalid user id: id=f46262bde88b4928997351a657901d8b, company=d1bd92a3b039400cbafc60a7a5b1e52b

  
  
Posted 3 years ago
Votes Newest

Answers 30


yes

  
  
Posted 3 years ago

and i found our lab seems only have shared user file because i installed trains on one node, but it doesn’t appear on the others

Do you mean there is no shared filesystem among the different machines ?

  
  
Posted 3 years ago

And do you need to run your code inside a docker, or is venv enough ?

  
  
Posted 3 years ago

we all use conda, guess not need for docker

  
  
Posted 3 years ago

yes

  
  
Posted 3 years ago

i think it can only run on multiple GPU at one node

Okay, the first step is to make sure your code is multi-node enabled, there is no magic for that 🙂

  
  
Posted 3 years ago

PompousHawk82 what do you mean by ?

but the thing is that i can only use master to log everything

  
  
Posted 3 years ago

This is assuming you can just run two copies of your code, and they will become aware of one another.

  
  
Posted 3 years ago

I’ll get back to you after i get this done

  
  
Posted 3 years ago

I think that is good enough

  
  
Posted 3 years ago

i’m just curious about how does trains server on different nodes communicate about the task queue

  
  
Posted 3 years ago

i’m just curious about how does trains server on different nodes communicate about the task queue

We start manual, we tell the agent just execute the task (notice we never enqueued it), if all goes well we will get to multi-node part 🙂

  
  
Posted 3 years ago

Basically I think I'm asking, is your code multi-node enabled to begin with ?

  
  
Posted 3 years ago

i see, now we are trying to let the agent pop up the experiment separately and see if they can communicate with each other, right?

  
  
Posted 3 years ago

never done this before, let me do a quick search

  
  
Posted 3 years ago

Correct

  
  
Posted 3 years ago

So that means your home folder is always mapped to ~/ on any machine you ssh to ?

  
  
Posted 3 years ago

Yes, let's assume we have a task with id aabbcc
On two different machines you can do the following:
trains-agent execute --docker --id aabbccThis means you manually spin two simultaneous copies of the same experiment, once they are up and running, will your code be able to make the connection between them? (i.e. openmpi torch distribute etc?)

  
  
Posted 3 years ago

(sure, we can try, conda is sometime flaky but is supported)
specify conda as the package manager:https://github.com/allegroai/trains-agent/blob/9a3f950ac689c50ba3415c42749a4bd8059e89a7/docs/trains.conf#L49
2. make sure trains-agent is install on both nodes
3. assuming you already have an experiment in the system, right click on the experiment and clone it. Then press on the ID button next to the experiment name, and copy the task ID
4. ssh to each node and run:
trains-agent execute --id <past_task_id_here>Let's see how that goes 🙂

  
  
Posted 3 years ago

Yeah, i’m done with the test, not i can run as what you said

  
  
Posted 3 years ago

Ok, i’ll try to do that asap

  
  
Posted 3 years ago

Not for now, i think it can only run on multiple GPU at one node

  
  
Posted 3 years ago

I’ve been added multi-node support for my code, and i found our lab seems only have shared user file because i installed trains on one node, but it doesn’t appear on the others

  
  
Posted 3 years ago

Is this a correct assumption ?

  
  
Posted 3 years ago

you mean two agents on two nodes?

  
  
Posted 3 years ago

Can I assume that if we have two agents spinning the same experiment, your code will take it from there?

Is this true ?

  
  
Posted 3 years ago

but the thing is that i can only use master to log everything

  
  
Posted 3 years ago

yes

  
  
Posted 3 years ago

but at least it’s working now

  
  
Posted 3 years ago

it’s shared but only user files, everything under ~/ directory

  
  
Posted 3 years ago
1K Views
30 Answers
3 years ago
one year ago
Tags