Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hey :wave: *Tensorboard Logs Overwhelming Elasticsearch* I am running a clear ml server, however when running experiments with tensorboard logging I am seeing the elastic indexing time increase drastically and in some cases I have also seen timeout erro

Hey 👋

Tensorboard Logs Overwhelming Elasticsearch

I am running a clear ml server, however when running experiments with tensorboard logging I am seeing the elastic indexing time increase drastically and in some cases I have also seen timeout errors on the tasks trying to upload metrics.

2023-12-05 23:39:35,585 - clearml.Metrics - ERROR - Action failed <500/100: events.add_batch/v1.0 (General data error (ConnectionTimeout caused by - ReadTimeoutError(HTTPConnectionPool(host='elasticsearch', port='9200'): Read timed out. (read timeout=60))))>

In one case I was able to resolve this by turning off tensorboard logging by setting auto_connect_frameworks={"tensorboard": False} on Task.init. Ideally I would like to have the tb logs on.

Has anyone else had this issue?
Thanks for any help 🙂

  
  
Posted 5 months ago
Votes Newest

Answers 6


Currently running it on a t3.xlarge which has 4CPU's, 16GB RAM and 300GB SSD

  
  
Posted 5 months ago

... training script was set to upload every epoch. Seems like this resulted in a torrent of metrics being uploaded.

oh that makes sense, so basically you were bombarding the server with requests, and ending with kind of denial of service

  
  
Posted 5 months ago

For an update 🙂
I think we identified that when moving from a training to fine tuning dataset (which was 1/1000th the size) our training script was set to upload every epoch. Seems like this resulted in a torrent of metrics being uploaded.

Since modifying this to be less frequent we have seen the index latency drop dramatically

  
  
Posted 5 months ago

Yep, almost a self DDoS

  
  
Posted 5 months ago

Hi @<1590152178218045440:profile|HarebrainedToad56>
Yes you are correct all TB logs are stored into the ELK in the clearml backend. This really scales well and rarely has issues, as long of course that the clearml-server is running on strong enough machine. How many RAM / HD you have on the clearml-server ?

  
  
Posted 5 months ago

How much free RAM / disk do you have there now? How's the CPU utilization ? how many Tasks are working with this machine at the same time

  
  
Posted 5 months ago
342 Views
6 Answers
5 months ago
5 months ago
Tags