Sample logs from the pod running training:
2023-07-27 17:38:21,854 - clearml.Metrics - ERROR - Action failed <504/0: events.add_batch (<html>
<head><title>504 Gateway Time-out</title></head>
<body>
<center><h1>504 Gateway Time-out</h1></center>
<hr><center>nginx</center>
</body>
</html>
)>
Log from clearml-apiserver pod:
[2023-07-27 17:57:22,033] [9] [ERROR] [clearml.service_repo] Returned 500 for events.add_batch in 60021ms, msg=General data error: err=('13 document(s) failed to index.', [{'index': {'_index': 'events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b', '_type': '_doc', '_id': '1f59ef315c90304e14f12b72fa6dd2aa', 'status': 503,..., extra_info=[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0]] containing [13] requests and a refresh]
You have an issue with ES with the main shard not being active, I assume this is related to the previous issue you reported related to the ES pods