(which is very rare - we're running lots of instances of the clearml server for a very long time and I've never encountered this issue 😞 )
Ok, i guess i will have to kill the whole thing and refresh it.
Can i somehow perform an export or backup?
Well, Elastic is used to store and index all experiment and worker metrics
Thanks that did solve the problem, the tasks are running again.
yes, previously run experiments. I will just kill clearml-elastic container if that may solve the problem.
I'm not familiar with elastic. What role does elastic play in ClearML?
SubstantialElk6 Both indices that are red are not critical for the ClearML functioning and can be deleted like this:curl -XDELETE ' ' curl -XDELETE ' 'For the analysis of the possible reasons that lead to it can you please collect the full ES logs to the file and send it here?sudo docker logs clearml-elastic > log.txt 2>&1
Can you do:curlfrom inside the elastic container?
I'm starting to suspect Docker Desktop 🙂
If you're running your own server - no, there is no limitation
docker exec clearml-elastic curl zsh: no matches found:
(trying to get as much info as possible first)
and yes, there are stuff in there. In fact its been running for a few weeks with no issue. This appears to have happened after i added new workers, though i can't be sure this is the cause. Is there a limit to the number of workers that i can add for community edition?
Hi SubstantialElk6 , another thing that can be checked is the health of the particular ES indices. Can you please run the below command in the clearml-elastic container and post the results here?curl -XGET
[root@2c7498711bef elasticsearch]# curl -XGET yellow open events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b 4hAFNtGkRr-CHNGnUYfbTA 1 1 4724 271 660.9kb 660.9kb yellow open events-log-d1bd92a3b039400cbafc60a7a5b1e52b M3qgFy1HRU2PibDOr1YOdw 1 1 1221 20 1013.6kb 1013.6kb red open worker_stats_d1bd92a3b039400cbafc60a7a5b1e52b_2021-05 EQK8mnlhRxCrrKK3clcUFA 1 1 red open queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2021-05 LZBWjeupRiuDPM50EB-0Ow 1 1 yellow open events-training_debug_image-d1bd92a3b039400cbafc60a7a5b1e52b CP9YJDjMQLOJ3KNJ9ydobA 1 1 16 0 24.7kb 24.7kb yellow open events-plot-d1bd92a3b039400cbafc60a7a5b1e52b FCwT4QMkR1G1aaqxvEzEnQ 1 1 3 0 12.7kb 12.7kb
docker exec clearml-elastic curl
zsh: no matches found:
No no, do:docker exec -it clearml-elastic /bin/bashAnd than from the bash inside the container, do:curl
It seems both the queue_metrics and worker_stats indices are in red status
[root@2c7498711bef elasticsearch]# curl { "index" : "events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b", "shard" : 0, "primary" : false, "current_state" : "unassigned", "unassigned_info" : { "reason" : "CLUSTER_RECOVERED", "at" : "2021-05-22T11:33:38.932Z", "last_allocation_status" : "no_attempt" }, "can_allocate" : "no", "allocate_explanation" : "cannot allocate because allocation is not permitted to any of the nodes", "node_allocation_decisions" : [ { "node_id" : "yHcROdoNRWGfm0Ry062NGQ", "node_name" : "clearml", "transport_address" : "172.18.0.3:9300", "node_attributes" : { "ml.machine_memory" : "8349540352", "xpack.installed" : "true", "ml.max_open_jobs" : "20" }, "node_decision" : "no", "deciders" : [ { "decider" : "same_shard", "decision" : "NO", "explanation" : "the shard cannot be allocated to the same node on which a copy of the shard already exists [[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0], node[yHcROdoNRWGfm0Ry062NGQ], [P], s[STARTED], a[id=J-nHYEXBSymWrgpo7J2bLw]]" } ] } ] }
As for backup - this is usually done by backing up the mounted clearml/data/ folder, but it is usually done when the server is down (otherwise the data maybe not be backed-up correctly)
Well, assuming your server data is properly mounted outside of the dockers, restarting the server will be just fine. Can you verify the externally mapped folders actually contain data? (in /Users/jax/clearml/data/mongo , /Users/jax/clearml/data/elastic_7 etc...)
Also, I doubt the reason was that you added workers, more likely something happened in the Elastic - be it some disk issue or something else
[root@2c7498711bef elasticsearch]# curl { "cluster_name" : "clearml", "status" : "red", "timed_out" : false, "number_of_nodes" : 1, "number_of_data_nodes" : 1, "active_primary_shards" : 4, "active_shards" : 4, "relocating_shards" : 0, "initializing_shards" : 0, "unassigned_shards" : 8, "delayed_unassigned_shards" : 0, "number_of_pending_tasks" : 0, "number_of_in_flight_fetch" : 0, "task_max_waiting_in_queue_millis" : 0, "active_shards_percent_as_number" : 33.33333333333333 }
BTW - spotted some issues with your docker-compose, namely:
` apiserver:
...
volumes:
...
- /Users/jax/clearml/config:/Users/jax/clearml/config
...
environment:
...
CLEARML__apiserver__pre_populate__zip_files: "/Users/jax/clearml/db-pre-populate" `Specifically, you can't have internal docker-image mounted directories to be identical to your external host directories - the server looks for the configuration (and other stuff) in fixed internal directories - your changes, among others, effectively make the server ignore the additional configuration files you might add (this is repeated in the ` fileserver ` section as well)