Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hey Guys, I'M Experiencing Seemingly Random Problems With The Experiments. There Are 4 Gpus And 8 Workers (2 Workers Per Gpu) , And Sometimes Experiments Randomly Fail (Or Complete) In The Middle Of The Epoch Without Any Additional Info In The Logs. What

hey guys, I'm experiencing seemingly random problems with the experiments. there are 4 GPUs and 8 workers (2 workers per GPU) , and sometimes experiments randomly fail (or complete) in the middle of the epoch without any additional info in the logs. what would be the best way to find out the root problem?

  
  
Posted 4 years ago
Votes Newest

Answers 8


Hi DilapidatedDucks58 ,
Just making sure all 8 works have different worker ids? (you can see 8 in the workers page in the UI)
Also, are they running this docker or venv mode?

  
  
Posted 4 years ago

docker mode
different ids

  
  
Posted 4 years ago

Could you verify you have 8 subfolders named 'venv.X' in the cache folder ~/. trains ?

  
  
Posted 4 years ago

image

  
  
Posted 4 years ago

example of the failed experiment

  
  
Posted 4 years ago

it might be that there is not enough space on our SSD, experiments cache a lot of preprocessed data during the first epoch...

  
  
Posted 4 years ago

If that's the case check the free space in the monitoring of the experiment, you will find the free space in GB logged

  
  
Posted 4 years ago

nice idea, thanks

  
  
Posted 4 years ago
1K Views
8 Answers
4 years ago
7 months ago
Tags