Hi GiganticMole91 , what version of ClearML server are you using?
Also, can you take a look inside the elastic container to see if there are any errors there?
We are running the latest version (WebApp: 1.7.0-232 • Server: 1.7.0-232 • API: 2.21).
When I run docker logs clearml-elastic
I get lots logs like this one:
{"type": "server", "timestamp": "2022-10-24T08:51:35,003Z", "level": "INFO", "component": "o.e.i.g.DatabaseNodeService", "cluster.name": "clearml", "node
.name": "clearml", "message": "successfully reloaded changed geoip database file [/tmp/elasticsearch-3596639242536548410/geoip-databases/cX7aMqJ4SwCxqM7s
YM-S9Q/GeoLite2-City.mmdb]", "cluster.uuid": "vTBu4rY0QhWZV01xnswjDg", "node.id": "cX7aMqJ4SwCxqM7sYM-S9Q"}
But only INFO logs
Were there any changes to your Elastic or your server in the past few days?
No, not at all. I recon we started seeing errors around mid-last week. We are using default settings for everything except some password-stuff on the server.
CostlyOstrich36 any thought on how we can further debug this? It's making ClearML practically useless for us
Googling ES error: es_rejected_execution_exception and it seems that it is caused due to the excessive load on ES. Apparently the hardware cannot keep up with the pace that you're sending events batches. I would recommend working with smaller batches and checking whether the error goes away.
GiganticMole91 how many experiments are you running concurrently? Are you reporting a lot of metrics/logs in each experiment?
SuccessfulKoala55 At peak we’ve been running ~50 experiments simultaneously that have been somewhat generous in reported metrics, although not extreme. Our CML server is hosted on an Azure D2S_v3 VM (2 vCPU, 8 GB RAM, 3200 IOPS). Looks like we should probably upgrade especially the disk specs. (Taking another look at our VM metrics we reached 100% OS disk IOPS consumed a couple of times.)
Yeah, for this server (which is close to the minimum requirements or even at the minimum requirements), this is really a very heavy load