I increased already the memory to 8GB after reading similar issues here on the slack`
Just making sure, how exactly did you do that?
That's it? no apparent error?
After the logs on the top there was only logs on "info" level with PluginsService
elasticsearch: networks: - backend container_name: clearml-elastic environment: ES_JAVA_OPTS: -Xms8g -Xmx8g bootstrap.memory_lock: "true" cluster.name: clearml cluster.routing.allocation.node_initial_primaries_recoveries: "500" cluster.routing.allocation.disk.watermark.low: 500mb cluster.routing.allocation.disk.watermark.high: 500mb cluster.routing.allocation.disk.watermark.flood_stage: 500mb discovery.zen.minimum_master_nodes: "1" discovery.type: "single-node" http.compression_level: "7" node.ingest: "true" node.name: clearml reindex.remote.whitelist: '*.*' xpack.monitoring.enabled: "false" xpack.security.enabled: "false" ulimits: memlock: soft: -1 hard: -1 nofile: soft: 65536 hard: 65536 image: docker.elastic.co/elasticsearch/elasticsearch:7.6.2 restart: unless-stopped ports: - "9200:9200" volumes: - /storage/data/elastic_7:/usr/share/elasticsearch/data - /usr/share/elasticsearch/logs
curl -XPUT -H 'Content-Type: application/json' 'localhost:9200/_settings' -d '{"index" : {"number_of_replicas" : 0}}
This command made all my indices beside the broken one which is still red, come green again. It comes from https://stackoverflow.com/questions/63403972/elasticsearch-index-in-red-health/63405623#63405623 .
 so you say deleting other old indices that I don't need could help?
This did not help, I still have the same issue
Solving the replica issue now allowed me to get better insights into why the one index is red.{ "index" : "events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b", "shard" : 0, "primary" : true, "current_state" : "unassigned", "unassigned_info" : { "reason" : "CLUSTER_RECOVERED", "at" : "2021-11-09T22:30:47.018Z", "last_allocation_status" : "no_valid_shard_copy" }, "can_allocate" : "no_valid_shard_copy", "allocate_explanation" : "cannot allocate because all found copies of the shard are either stale or corrupt", "node_allocation_decisions" : [ { "node_id" : "CldaHbiyQWaNcpWtVab35w", "node_name" : "clearml", "transport_address" : "172.28.0.5:9300", "node_attributes" : { "ml.machine_memory" : "34244124672", "xpack.installed" : "true", "ml.max_open_jobs" : "20" }, "node_decision" : "no", "store" : { "in_sync" : true, "allocation_id" : "emdrRuHVQ8asg5LU_HVkGw", "store_exception" : { "type" : "corrupt_index_exception", "reason" : "failed engine (reason: [refresh failed source[refresh_flag_index]]) (resource=preexisting_corruption)", "caused_by" : { "type" : "i_o_exception", "reason" : "failed engine (reason: [refresh failed source[refresh_flag_index]])", "caused_by" : { "type" : "corrupt_index_exception", "reason" : "codec footer mismatch (file truncated?): actual footer=0 vs expected footer=-1071082520 (resource=BufferedChecksumIndexInput(NIOFSIndexInput(path=\"/usr/share/elasticsearch/data/nodes/0/indices/T5e15fRWTvW69oI3Cm2BeQ/0/index/_e1il.fdt\")))" } } } } } ] }
ssh into the elasticsearch container identify the id of the index that seem to be broken run /usr/share/elasticsearch/jdk/bin/java -cp lucene-core*.jar -ea:org.apache.lucene… org.apache.lucene.index.CheckIndex /usr/share/elasticsearch/data/nodes/0/indices/your-id/0/index/ -verbose -exorcise
This can be dangerous but is the only option if you assume that the data is lost anyway. either running 3. repairs broken segments or it shows as in my case No problems were detected with this index.
If it shows "no problems detected" just go to the index folder and remove any file starting with corrupted_*
restart elasticsearch the whole cluster turns green
Also,Â
 how much memory is allocated for ES? (it's in the docker-compose file)
I increased already the memory to 8GB after reading similar issues here on the slack
Well, I just took a look at the log, and it looks like the configuration is for 1GB only (see -Xms1g, -Xmx1g
) - perhaps that's the reason?
, what version of clearml is your server?
the docker-compose use clearml:latest
Yes, this happened when the disk got filled up to 100%
I increased already the memory to 8GB after reading similar issues here on the slack`
Just making sure, how exactly did you do that?
docker-compose down
elasticsearch: networks: - backend container_name: clearml-elastic environment: ES_JAVA_OPTS: -Xms8g -Xmx8g `` docker-compose up -d
The output seen above indicates that the index is corrupt and probably lost, but that is not necessary the case
Did you wait for all the other indices to reach yellow status?
yes I waited until everything was yellow
since it is a single node, I guess it will not possible to recover or partially recover the index right?
root@ubuntu:/opt/clearml# sudo docker logs clearml-elastic OpenJDK 64-Bit Server VM warning: Option UseConcMarkSweepGC was deprecated in version 9.0 and will likely be removed in a future release. {"type": "server", "timestamp": "2021-11-09T12:49:13,403Z", "level": "INFO", "component": "o.e.e.NodeEnvironment", "cluster.name": "clearml", "node.name": "clearml", "message": "using [1] data paths, mounts [[/usr/share/elasticsearch/data (//some_ip/clearml-server-data)]], net usable_space [3.4tb], net total_space [6.9tb], types [cifs]" } {"type": "server", "timestamp": "2021-11-09T12:49:13,407Z", "level": "INFO", "component": "o.e.e.NodeEnvironment", "cluster.name": "clearml", "node.name": "clearml", "message": "heap size [7.9gb], compressed ordinary object pointers [true]" } {"type": "server", "timestamp": "2021-11-09T12:49:14,529Z", "level": "INFO", "component": "o.e.n.Node", "cluster.name": "clearml", "node.name": "clearml", "message": "node name [clearml], node ID [CldaHbiyQWaNcpWtVab35w], cluster name [clearml]" } {"type": "server", "timestamp": "2021-11-09T12:49:14,529Z", "level": "INFO", "component": "o.e.n.Node", "cluster.name": "clearml", "node.name": "clearml", "message": "version[7.6.2], pid[1], build[default/docker/ef48eb35cf30adf4db14086e8aabd07ef6fb113f/2020-03-26T06:34:37.794943Z], OS[Linux/5.4.0-89-generic/amd64], JVM[AdoptOpenJDK/OpenJDK 64-Bit Server VM/13.0.2/13.0.2+8]" } {"type": "server", "timestamp": "2021-11-09T12:49:14,530Z", "level": "INFO", "component": "o.e.n.Node", "cluster.name": "clearml", "node.name": "clearml", "message": "JVM home [/usr/share/elasticsearch/jdk]" } {"type": "server", "timestamp": "2021-11-09T12:49:14,530Z", "level": "INFO", "component": "o.e.n.Node", "cluster.name": "clearml", "node.name": "clearml", "message": "JVM arguments [-Des.networkaddress.cache.ttl=60, -Des.networkaddress.cache.negative.ttl=10, -XX:+AlwaysPreTouch, -Xss1m, -Djava.awt.headless=true, -Dfile.encoding=UTF-8, -Djna.nosys=true, -XX:-OmitStackTraceInFastThrow, -Dio.netty.noUnsafe=true, -Dio.netty.noKeySetOptimization=true, -Dio.netty.recycler.maxCapacityPerThread=0, -Dio.netty.allocator.numDirectArenas=0, -Dlog4j.shutdownHookEnabled=false, -Dlog4j2.disable.jmx=true, -Djava.locale.providers=COMPAT, -Xms1g, -Xmx1g, -XX:+UseConcMarkSweepGC, -XX:CMSInitiatingOccupancyFraction=75, -XX:+UseCMSInitiatingOccupancyOnly, -Djava.io.tmpdir=/tmp/elasticsearch-8140206772120400095, -XX:+HeapDumpOnOutOfMemoryError, -XX:HeapDumpPath=data, -XX:ErrorFile=logs/hs_err_pid%p.log, -Xlog:gc*,gc+age=trace,safepoint:file=logs/gc.log:utctime,pid,tags:filecount=32,filesize=64m, -Des.cgroups.hierarchy.override=/, -Xms8g, -Xmx8g, -XX:MaxDirectMemorySize=4294967296, -Des.path.home=/usr/share/elasticsearch, -Des.path.conf=/usr/share/elasticsearch/config, -Des.distribution.flavor=default, -Des.distribution.type=docker, -Des.bundled_jdk=true]" }
docker-compose down / up does not help
Did you wait for all the other indices to reach yellow status?
I'm not sure, but it's possible you can't recover it - 100% disk usage is always a major problem
SuccessfulKoala55 so you say deleting other old indices that I don't need could help?
Try to restart ES and see if it helps
docker-compose down / up does not help
I'm not entirely sure, but it may help
I will try to recover it, but anyway the learning is to fully separate the fileserver and any output location from mongo, redis and elastic. Also maybe it makes sense the improve the ES setup to have replicas
I usually use different partitions. The replicas are always a good idea, but they do require more memory and disk space, so this is not in the default configuration
I meant sudo docker logs clearml-elastic
using top
inside the elasticsearch container shows elastic+ 20  0  17.0g  8.7g 187584 S  2.3 27.2  1:09.18 java
that the 8g are reserved. So setting ES_JAVA_OPTS: -Xms8g -Xmx8g
should work.
Can you send some more comprehensive log - perhaps there are other messages that are related
which logs do you wish?