Hi All, I Have An Elasticsearch Problem On My Clearml Server. The Error Message I Get On The Clearml Webapp Is

Answered

Hi all, I have an Elasticsearch problem on my ClearML server. The error message I get on the ClearML webapp is General data error (TransportError(503, 'search_phase_execution_exception')) , which appears on any operation that uses elasticsearch. I have looked into the elasticsearch docker container and there is an index with status red. In general this issue occurred after the clearml server ran out of disk storage last night. Logs will be found in the thread

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					ClumsyElephant70
				
					0
					 × 1

Votes Newest

Answers 30

Can you send some more comprehensive log - perhaps there are other messages that are related

which logs do you wish?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					ClumsyElephant70
				
					0
					 × 1

root@ubuntu:/opt/clearml# sudo docker logs clearml-elastic OpenJDK 64-Bit Server VM warning: Option UseConcMarkSweepGC was deprecated in version 9.0 and will likely be removed in a future release. {"type": "server", "timestamp": "2021-11-09T12:49:13,403Z", "level": "INFO", "component": "o.e.e.NodeEnvironment", "cluster.name": "clearml", "node.name": "clearml", "message": "using [1] data paths, mounts [[/usr/share/elasticsearch/data (//some_ip/clearml-server-data)]], net usable_space [3.4tb], net total_space [6.9tb], types [cifs]" } {"type": "server", "timestamp": "2021-11-09T12:49:13,407Z", "level": "INFO", "component": "o.e.e.NodeEnvironment", "cluster.name": "clearml", "node.name": "clearml", "message": "heap size [7.9gb], compressed ordinary object pointers [true]" } {"type": "server", "timestamp": "2021-11-09T12:49:14,529Z", "level": "INFO", "component": "o.e.n.Node", "cluster.name": "clearml", "node.name": "clearml", "message": "node name [clearml], node ID [CldaHbiyQWaNcpWtVab35w], cluster name [clearml]" } {"type": "server", "timestamp": "2021-11-09T12:49:14,529Z", "level": "INFO", "component": "o.e.n.Node", "cluster.name": "clearml", "node.name": "clearml", "message": "version[7.6.2], pid[1], build[default/docker/ef48eb35cf30adf4db14086e8aabd07ef6fb113f/2020-03-26T06:34:37.794943Z], OS[Linux/5.4.0-89-generic/amd64], JVM[AdoptOpenJDK/OpenJDK 64-Bit Server VM/13.0.2/13.0.2+8]" } {"type": "server", "timestamp": "2021-11-09T12:49:14,530Z", "level": "INFO", "component": "o.e.n.Node", "cluster.name": "clearml", "node.name": "clearml", "message": "JVM home [/usr/share/elasticsearch/jdk]" } {"type": "server", "timestamp": "2021-11-09T12:49:14,530Z", "level": "INFO", "component": "o.e.n.Node", "cluster.name": "clearml", "node.name": "clearml", "message": "JVM arguments [-Des.networkaddress.cache.ttl=60, -Des.networkaddress.cache.negative.ttl=10, -XX:+AlwaysPreTouch, -Xss1m, -Djava.awt.headless=true, -Dfile.encoding=UTF-8, -Djna.nosys=true, -XX:-OmitStackTraceInFastThrow, -Dio.netty.noUnsafe=true, -Dio.netty.noKeySetOptimization=true, -Dio.netty.recycler.maxCapacityPerThread=0, -Dio.netty.allocator.numDirectArenas=0, -Dlog4j.shutdownHookEnabled=false, -Dlog4j2.disable.jmx=true, -Djava.locale.providers=COMPAT, -Xms1g, -Xmx1g, -XX:+UseConcMarkSweepGC, -XX:CMSInitiatingOccupancyFraction=75, -XX:+UseCMSInitiatingOccupancyOnly, -Djava.io.tmpdir=/tmp/elasticsearch-8140206772120400095, -XX:+HeapDumpOnOutOfMemoryError, -XX:HeapDumpPath=data, -XX:ErrorFile=logs/hs_err_pid%p.log, -Xlog:gc*,gc+age=trace,safepoint:file=logs/gc.log:utctime,pid,tags:filecount=32,filesize=64m, -Des.cgroups.hierarchy.override=/, -Xms8g, -Xmx8g, -XX:MaxDirectMemorySize=4294967296, -Des.path.home=/usr/share/elasticsearch, -Des.path.conf=/usr/share/elasticsearch/config, -Des.distribution.flavor=default, -Des.distribution.type=docker, -Des.bundled_jdk=true]" }

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					ClumsyElephant70
				
					0
					 × 1

elasticsearch: networks: - backend container_name: clearml-elastic environment: ES_JAVA_OPTS: -Xms8g -Xmx8g bootstrap.memory_lock: "true" cluster.name: clearml cluster.routing.allocation.node_initial_primaries_recoveries: "500" cluster.routing.allocation.disk.watermark.low: 500mb cluster.routing.allocation.disk.watermark.high: 500mb cluster.routing.allocation.disk.watermark.flood_stage: 500mb discovery.zen.minimum_master_nodes: "1" discovery.type: "single-node" http.compression_level: "7" node.ingest: "true" node.name: clearml reindex.remote.whitelist: '*.*' xpack.monitoring.enabled: "false" xpack.security.enabled: "false" ulimits: memlock: soft: -1 hard: -1 nofile: soft: 65536 hard: 65536 image: docker.elastic.co/elasticsearch/elasticsearch:7.6.2 restart: unless-stopped ports: - "9200:9200" volumes: - /storage/data/elastic_7:/usr/share/elasticsearch/data - /usr/share/elasticsearch/logs

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					ClumsyElephant70
				
					0
					 × 1

Yes, this happened when the disk got filled up to 100%

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					ClumsyElephant70
				
					0
					 × 1

Cool! 🙂

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

curl -XPUT -H 'Content-Type: application/json' 'localhost:9200/_settings' -d '{"index" : {"number_of_replicas" : 0}}This command made all my indices beside the broken one which is still red, come green again. It comes from https://stackoverflow.com/questions/63403972/elasticsearch-index-in-red-health/63405623#63405623 .

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					ClumsyElephant70
				
					0
					 × 1

Did you wait for all the other indices to reach yellow status?

yes I waited until everything was yellow

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					ClumsyElephant70
				
					0
					 × 1

Try to restart ES and see if it helps

docker-compose down / up does not help

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					ClumsyElephant70
				
					0
					 × 1

, what version of clearml is your server?

the docker-compose use clearml:latest

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					ClumsyElephant70
				
					0
					 × 1

So is this a corrupt storage issue?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

I increased already the memory to 8GB after reading similar issues here on the slack`

Just making sure, how exactly did you do that?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

That's it? no apparent error?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

docker-compose down / up does not help

Did you wait for all the other indices to reach yellow status?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

so you say deleting other old indices that I don't need could help?

This did not help, I still have the same issue

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					ClumsyElephant70
				
					0
					 × 1

Solving the replica issue now allowed me to get better insights into why the one index is red.
{ "index" : "events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b", "shard" : 0, "primary" : true, "current_state" : "unassigned", "unassigned_info" : { "reason" : "CLUSTER_RECOVERED", "at" : "2021-11-09T22:30:47.018Z", "last_allocation_status" : "no_valid_shard_copy" }, "can_allocate" : "no_valid_shard_copy", "allocate_explanation" : "cannot allocate because all found copies of the shard are either stale or corrupt", "node_allocation_decisions" : [ { "node_id" : "CldaHbiyQWaNcpWtVab35w", "node_name" : "clearml", "transport_address" : "172.28.0.5:9300", "node_attributes" : { "ml.machine_memory" : "34244124672", "xpack.installed" : "true", "ml.max_open_jobs" : "20" }, "node_decision" : "no", "store" : { "in_sync" : true, "allocation_id" : "emdrRuHVQ8asg5LU_HVkGw", "store_exception" : { "type" : "corrupt_index_exception", "reason" : "failed engine (reason: [refresh failed source[refresh_flag_index]]) (resource=preexisting_corruption)", "caused_by" : { "type" : "i_o_exception", "reason" : "failed engine (reason: [refresh failed source[refresh_flag_index]])", "caused_by" : { "type" : "corrupt_index_exception", "reason" : "codec footer mismatch (file truncated?): actual footer=0 vs expected footer=-1071082520 (resource=BufferedChecksumIndexInput(NIOFSIndexInput(path=\"/usr/share/elasticsearch/data/nodes/0/indices/T5e15fRWTvW69oI3Cm2BeQ/0/index/_e1il.fdt\")))" } } } } } ] }

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					ClumsyElephant70
				
					0
					 × 1

I'm not sure, but it's possible you can't recover it - 100% disk usage is always a major problem

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

ssh into the elasticsearch container identify the id of the index that seem to be broken run /usr/share/elasticsearch/jdk/bin/java -cp lucene-core*.jar -ea:org.apache.lucene… org.apache.lucene.index.CheckIndex /usr/share/elasticsearch/data/nodes/0/indices/your-id/0/index/ -verbose -exorcise This can be dangerous but is the only option if you assume that the data is lost anyway. either running 3. repairs broken segments or it shows as in my case No problems were detected with this index. If it shows "no problems detected" just go to the index folder and remove any file starting with corrupted_* restart elasticsearch the whole cluster turns green

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					ClumsyElephant70
				
					0
					 × 1

using top inside the elasticsearch container shows elastic+ 20 0 17.0g 8.7g 187584 S 2.3 27.2 1:09.18 java that the 8g are reserved. So setting ES_JAVA_OPTS: -Xms8g -Xmx8g should work.

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					ClumsyElephant70
				
					0
					 × 1

I'm not entirely sure, but it may help

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

since it is a single node, I guess it will not possible to recover or partially recover the index right?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					ClumsyElephant70
				
					0
					 × 1

The output seen above indicates that the index is corrupt and probably lost, but that is not necessary the case

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					ClumsyElephant70
				
					0
					 × 1

SuccessfulKoala55 so you say deleting other old indices that I don't need could help?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					ClumsyElephant70
				
					0
					 × 1

I increased already the memory to 8GB after reading similar issues here on the slack`

Just making sure, how exactly did you do that?

docker-compose down
elasticsearch: networks: - backend container_name: clearml-elastic environment: ES_JAVA_OPTS: -Xms8g -Xmx8g `` docker-compose up -d

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					ClumsyElephant70
				
					0
					 × 1

I usually use different partitions. The replicas are always a good idea, but they do require more memory and disk space, so this is not in the default configuration

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Also,

how much memory is allocated for ES? (it's in the docker-compose file)

I increased already the memory to 8GB after reading similar issues here on the slack

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					ClumsyElephant70
				
					0
					 × 1

I will try to recover it, but anyway the learning is to fully separate the fileserver and any output location from mongo, redis and elastic. Also maybe it makes sense the improve the ES setup to have replicas

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					ClumsyElephant70
				
					0
					 × 1

Very good news!

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					ClumsyElephant70
				
					0
					 × 1

That's it? no apparent error?

After the logs on the top there was only logs on "info" level with PluginsService

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					ClumsyElephant70
				
					0
					 × 1

Well, I just took a look at the log, and it looks like the configuration is for 1GB only (see -Xms1g, -Xmx1g ) - perhaps that's the reason?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

I meant sudo docker logs clearml-elastic

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Write your answer

2K Views

30 Answers

3 years ago

2 years ago