Hi GiganticMole91 , what version of ClearML server are you using?
Also, can you take a look inside the elastic container to see if there are any errors there?
Googling ES error: es_rejected_execution_exception and it seems that it is caused due to the excessive load on ES. Apparently the hardware cannot keep up with the pace that you're sending events batches. I would recommend working with smaller batches and checking whether the error goes away.
No, not at all. I recon we started seeing errors around mid-last week. We are using default settings for everything except some password-stuff on the server.
CostlyOstrich36 any thought on how we can further debug this? It's making ClearML practically useless for us
GiganticMole91 how many experiments are you running concurrently? Are you reporting a lot of metrics/logs in each experiment?
Yeah, for this server (which is close to the minimum requirements or even at the minimum requirements), this is really a very heavy load
We are running the latest version (WebApp: 1.7.0-232 • Server: 1.7.0-232 • API: 2.21).
When I run docker logs clearml-elastic
I get lots logs like this one:
{"type": "server", "timestamp": "2022-10-24T08:51:35,003Z", "level": "INFO", "component": "o.e.i.g.DatabaseNodeService", "cluster.name": "clearml", "node
.name": "clearml", "message": "successfully reloaded changed geoip database file [/tmp/elasticsearch-3596639242536548410/geoip-databases/cX7aMqJ4SwCxqM7s
YM-S9Q/GeoLite2-City.mmdb]", "cluster.uuid": "vTBu4rY0QhWZV01xnswjDg", "node.id": "cX7aMqJ4SwCxqM7sYM-S9Q"}
But only INFO logs
SuccessfulKoala55 At peak we’ve been running ~50 experiments simultaneously that have been somewhat generous in reported metrics, although not extreme. Our CML server is hosted on an Azure D2S_v3 VM (2 vCPU, 8 GB RAM, 3200 IOPS). Looks like we should probably upgrade especially the disk specs. (Taking another look at our VM metrics we reached 100% OS disk IOPS consumed a couple of times.)
Were there any changes to your Elastic or your server in the past few days?