Hi Everyone, Has Anyone Ever Had Issues With An Elasticsearch Index Being Corrupted? We Are Unable To Load The "Scalars" Tab On Any Experiment Without Getting The Error

Answered

Hi everyone, has anyone ever had issues with an Elasticsearch index being corrupted? We are unable to load the "scalars" tab on any experiment without getting the error Error 100 : General data error (ApiError(503, 'search_phase_execution_exception', None)) . Diving into this a bit more and running curl None gives us the following

{
  "note" : "No shard was specified in the explain API request, so this response explains a randomly chosen unassigned shard. There may be other unassigned shards in this cluster which cannot be assigned for different reasons. It may not be possible to assign this shard until one of the other shards is assigned correctly. To explain the allocation of other shards (whether assigned or unassigned) you must specify the target shard in the request to this API.",
  "index" : "events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b",
  "shard" : 0,
  "primary" : true,
  "current_state" : "unassigned",
  "unassigned_info" : {
    "reason" : "CLUSTER_RECOVERED",
    "at" : "2024-08-28T09:57:26.523Z",
    "last_allocation_status" : "no_valid_shard_copy"
  },
  "can_allocate" : "no_valid_shard_copy",
  "allocate_explanation" : "cannot allocate because all found copies of the shard are either stale or corrupt",
  "node_allocation_decisions" : [
    {
      "node_id" : "85c1ZE3gTrqvov4AY2LXnQ",
      "node_name" : "clearml",
      "transport_address" : "172.19.0.4:9300",
      "node_attributes" : {
        "ml.machine_memory" : "67360030720",
        "xpack.installed" : "true",
        "transform.node" : "true",
        "ml.max_open_jobs" : "512",
        "ml.max_jvm_size" : "33285996544"
      },
      "node_decision" : "no",
      "store" : {
        "in_sync" : true,
        "allocation_id" : "0mE00e0yQSyTtJGQSPeJeQ",
        "store_exception" : {
          "type" : "corrupt_index_exception",
          "reason" : "failed engine (reason: [merge failed]) (resource=preexisting_corruption)",
          "caused_by" : {
            "type" : "i_o_exception",
            "reason" : "failed engine (reason: [merge failed])",
            "caused_by" : {
              "type" : "corrupt_index_exception",
              "reason" : "checksum failed (hardware problem?) : expected=f0199c51 actual=508854e (resource=BufferedChecksumIndexInput(MMapIndexInput(path=\"/usr/share/elasticsearch/data/nodes/0/indices/DIrYFcq5SW6yCFUBVwV-SQ/0/index/_lvu1b.cfs\") [slice=_lvu1b.fdt]))"
            }
          }
        }
      }
    }
  ]
}

I can also see the corrupted file with

$ ls /opt/clearml/data/elastic_7/nodes/0/indices/DIrYFcq5SW6yCFUBVwV-SQ/0/index/
corrupted_A6n_6MHlRcyDZ68HdB7B6w ...

Does anyone know why this might have happened/if there is anyway to recover the index to avoid dataloss? Many thanks 🙂

  				
Posted 
	7 months ago

					More  		
  Report
		
					MysteriousParrot48
				
					0
					 × 1

Votes Newest

Answers 2

Thanks CostlyOstrich36 , I thought it might be! I'll have a look over there

  				
Posted 
	7 months ago

					More  		
  Report
		
					MysteriousParrot48
				
					0
					 × 1

Hi MysteriousParrot48 , I'm afraid that this looks like a pure ElasticSearch issue, I'd suggest checking on ES forums for help on this

  				
Posted 
	7 months ago

					More  		
  Report
		
					CostlyOstrich36
				
					0

Write your answer

468 Views

2 Answers

7 months ago