I Am Also Experiencing Another Issue: I Am Getting A 500 Level Error When Logging A Yolov5 Training Run To My Clearml Server

Answered

I am also experiencing another issue: I am getting a 500 level error when logging a YoloV5 training run to my ClearML server

  				
Posted 
	one year ago

					More
				  		
  Report
		
					NonsensicalBat51
				
					0
					 × 1

Votes Newest

Answers 2

Sample logs from the pod running training:

2023-07-27 17:38:21,854 - clearml.Metrics - ERROR - Action failed <504/0: events.add_batch (<html>
<head><title>504 Gateway Time-out</title></head>
<body>
<center><h1>504 Gateway Time-out</h1></center>
<hr><center>nginx</center>
</body>
</html>
)>

Log from clearml-apiserver pod:

[2023-07-27 17:57:22,033] [9] [ERROR] [clearml.service_repo] Returned 500 for events.add_batch in 60021ms, msg=General data error: err=('13 document(s) failed to index.', [{'index': {'_index': 'events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b', '_type': '_doc', '_id': '1f59ef315c90304e14f12b72fa6dd2aa', 'status': 503,..., extra_info=[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0]] containing [13] requests and a refresh]

  				
Posted 
	one year ago

					More
				  		
  Report
		
					NonsensicalBat51
				
					0
					 × 1

You have an issue with ES with the main shard not being active, I assume this is related to the previous issue you reported related to the ES pods

  				
Posted 
	one year ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Write your answer

869 Views

2 Answers

one year ago