Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hi, I Am Trying To Execeute My Code On Nvidia/Cuda Docker, But It Keeps Running, It Is Not Failed Or Not Aborted. The Last Log Message Is

Hi, I am trying to execeute my code on nvidia/cuda docker, but it keeps running, it is not failed or not aborted. The last log message is "Successfully installed aadict-0.2.3 asset-0.6.13 globre-0.1.5 pyhocon-0.3.55 requirements-parser-0.2.0 trains-agent-0.15.1 virtualenv-16.7.10". After this log message, nothing is happening. What should I do in order to execute my code after this message.

  
  
Posted 4 years ago
Votes Newest

Answers 30


chown: cannot access '/root/.cache/pip': No such file or directory It gives this error

  
  
Posted 4 years ago

ohh right, my bad:
docker run -t --rm nvidia/cuda:10.1-base-ubuntu18.04 bash -c "echo 'Binary::apt::APT::Keep-Downloaded-Packages \"true\";' > /etc/apt/apt.conf.d/docker-clean && apt-get update && apt-get install -y git python3-pip && pip install trains-agent && echo done"

  
  
Posted 4 years ago

At the end, there is an error about "pip"

  
  
Posted 4 years ago

MysteriousBee56 Okay, let's try this one:
docker run -t --rm nvidia/cuda:10.1-base-ubuntu18.04 bash -c "echo 'Binary::apt::APT::Keep-Downloaded-Packages \"true\";' > /etc/apt/apt.conf.d/docker-clean && apt-get update && apt-get install -y git python3-pip && python3 -m pip install trains-agent && echo done"

  
  
Posted 4 years ago

It worked

  
  
Posted 4 years ago

Okay now let's try: EDIT
docker run -t --rm nvidia/cuda:10.1-base-ubuntu18.04 bash -c "echo 'Binary::apt::APT::Keep-Downloaded-Packages \"true\";' > /etc/apt/apt.conf.d/docker-clean && apt-get update && apt-get install -y git python3-pip && python3 -m pip install trains-agent && python3 -m trains-agent --help"

  
  
Posted 4 years ago

/usr/bin/python3: No module named trains-agent Because of the error I thought, I run first command, but I run edited version. It gives this error (That's why, it takes time 😞 )

  
  
Posted 4 years ago

It worked when I changed python3 -m trains-agent --help to trains-agent --help

  
  
Posted 4 years ago

Okay that might explain the issue...
MysteriousBee56 so what you are saying is
python3 -m trains-agent --help does NOT work
but trains-agent --help does work?

  
  
Posted 4 years ago

AgitatedDove14 Yes, that's what I am saying.

  
  
Posted 4 years ago

MysteriousBee56 that is so weird ... last one, I promise 🙂
docker run -t --rm nvidia/cuda:10.1-base-ubuntu18.04 bash -c "echo 'Binary::apt::APT::Keep-Downloaded-Packages \"true\";' > /etc/apt/apt.conf.d/docker-clean && apt-get update && apt-get install -y git python3-pip && python3 -m pip install trains-agent && echo \$(which python3) && echo \$(which trains-agent)"

  
  
Posted 4 years ago

AgitatedDove14 /usr/bin/python3
/usr/local/bin/trains-agent

  
  
Posted 4 years ago

AgitatedDove14 I might find something to fix the issue but I am not sure. In trains-agent worker.py script log it is written like that python3 -u -m trains_agent execute --disable-monitoring --id 9fe6d610a2b946379255b0fc25b5f9fd') so at the end there is an extra " ' ".  So when I run this script in my local environment by writing python3 -u -m trains_agent execute --disable-monitoring --id 9fe6d610a2b946379255b0fc25b5f9fd it works and runs the code. However, if I write python3 -u -m trains_agent execute --disable-monitoring --id 9fe6d610a2b946379255b0fc25b5f9fd' it waits for a string. However, I could not find where '   it comes.

  
  
Posted 4 years ago

MysteriousBee56 that is very strange definitely explains it, kudos on debugging it !!!

  
  
Posted 4 years ago

I'm checking now to see where the extra ' could come from

  
  
Posted 4 years ago

MysteriousBee56 when you run the trains-agent with --foreground , before it starts the docker it print the full command line, could you send it please?
I can't figure out where the extra ' came from...
Also could you send the trains.conf file?
(feel free to redact and confidential information)

  
  
Posted 4 years ago

AgitatedDove14

  
  
Posted 4 years ago

MysteriousBee56 and please this one: "when you run the  trains-agent  with --foreground , before it starts the docker it print the full command line"

  
  
Posted 4 years ago

Hmmm.
could you change the api_server: http://localhost:8008 to your host IP?
for example:
api_server: http://192.168.1.11:8008

  
  
Posted 4 years ago

Btw, we figure out that ' is belong the echo. So there is no problem with that one.

  
  
Posted 4 years ago

BTW, we figure out that  

'

  is belong the echo

yep, when seeing the full command it is apparent

  
  
Posted 4 years ago

I suspect it's the localhost - and the trains-agent is trying too hard to access the port, but for some reason does not report an error ...

  
  
Posted 4 years ago

I wil try it with different port

  
  
Posted 4 years ago

MysteriousBee56 not a different port, just not with "localhost" but with your machine's IP

  
  
Posted 4 years ago

SUCCESS!!!

  
  
Posted 4 years ago

I was just able to reproduce with "localhost"

  
  
Posted 4 years ago

How can I change the apiserver from localhost to my machine's IP. I couldn't figure it out. Sorry.

  
  
Posted 4 years ago

MysteriousBee56 Edit in your ~/trains.conf:
api_server: http://localhost:8008
to
api_server: http://192.168.1.11:8008
and obliviously the same for web & files

I'll make sure we fix the trains-agent to output an error message instead of trying to silently keep accessing the API server

Getting you machine ip:
just run :
ifconfig | grep 'inet addr:'Then you should see a bunch of lines, pick the one that does not start with 127 or 172
Then to verify run
ping <my_ip_here>

  
  
Posted 4 years ago

It worked. Thanks a lot 🙂

  
  
Posted 4 years ago

Yey! MysteriousBee56 kudos on keep trying!
I'll make sure we report those errors, because this debug process should have much shorter 🙂

  
  
Posted 4 years ago
1K Views
30 Answers
4 years ago
one year ago
Tags