Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Unanswered
Hi All, We Have Clearml-Server Running On A Kube Pod, And Then A Gpu Server Running The Clearml-Agent Which We Use To Queue Jobs. For Some Reason, Our Kube Pod Restarted (We'Re Looking Into Why), But In The Process Of This Happening All Jobs On The Worke


Hi @<1523701087100473344:profile|SuccessfulKoala55> thanks for the reply! The output above is from grep -i network /var/log/syslog on the machine running the agent. That's good to hear that clearml is pretty resilient to network outages 🙂 . Do you have any suggestions on how we can start tracking down the cause of this?

This is the only clue that was logged to the console in clearml server: 2024-11-21 06:57:13 Process terminated by user . The first errors on the agent logs appeared at 06:56:01.

I asked our HPC folks and they were not able to see any obvious network dropouts on other servers in the same location. Our DevOps eng also didn't see anything happen in Kube at that time. Looking at the uptime of the server & and the agent pods/machines, neither have rebooted in the time since this issue.

This might be a tricky one to track down since we've only seen it a handful of times...

  
  
Posted one month ago
18 Views
0 Answers
one month ago
29 days ago