Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Hi Guys, Last Night One Of Our Agents (0.16.1) Was Disconnected From Our Trains-Server While Executing An Experiment. I Saw That Because The Experiment It Was Running Had The Status Aborted And I Could Not See The Agent In The List Of Available Workers. H

Hi guys,
Last night one of our agents (0.16.1) was disconnected from our trains-server while executing an experiment. I saw that because the experiment it was running had the status Aborted and I could not see the agent in the list of available workers. Hence I res-established the connection and the agent sent the logs to the server, but killed the task.
I can see in the logs, after reconnection of the agent to the server:
2020-11-12 09:00:33 User aborted: stopping task (3) 2020-11-12 09:00:33 020-11-12 09:00:11,203 - trains.Task - ERROR - Action failed <400/110: models.update_for_task/v1.0 (Invalid task status (model can only be updated for tasks in the ['created', 'in_progress'] states): id=..., company=..)> (task=..)Shouldn't the trains-agent be able to detect that the server is not available, stack the logs locally and as soon as server is reachable again, send the logs of the running experiment to the server and continue the experiment instead of killing it?

Posted 2 years ago
Votes Newest

Answers 2

Hi JitteryCoyote63 ,
This is behavior is actually a result of a cleanup service running inside the Trains Server, called the non-responsive tasks watchdog . This service is meant to clean up any dangling tasks/experiments that were forgotten in an invalid or running state and did not report for a long time (for example, when you run a development code and simply abort it in your debugger).
The non-responsive timeout (after which such experiments are deemed non-responsive) is currently set to 2 hours, and can be easily changed in the server's configuration (setting is under services.tasks.non_responsive_tasks_watchdog.threshold_sec , so you can add a services.conf configuration file and set the non_responsive_tasks_watchdog.threshold_sec value to any number you wish)

Posted 2 years ago

very cool, good to know, thanks SuccessfulKoala55 🙂

Posted 2 years ago
2 Answers
2 years ago
8 months ago
Similar posts