Thanks that did solve the problem, the tasks are running again.
SubstantialElk6 Both indices that are red are not critical for the ClearML functioning and can be deleted like this:curl -XDELETE '
' curl -XDELETE '
'
For the analysis of the possible reasons that lead to it can you please collect the full ES logs to the file and send it here?sudo docker logs clearml-elastic > log.txt 2>&1
Well, Elastic is used to store and index all experiment and worker metrics
I'm not familiar with elastic. What role does elastic play in ClearML?
It seems both the queue_metrics
and worker_stats
indices are in red status
[root@2c7498711bef elasticsearch]# curl -XGET
yellow open events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b 4hAFNtGkRr-CHNGnUYfbTA 1 1 4724 271 660.9kb 660.9kb yellow open events-log-d1bd92a3b039400cbafc60a7a5b1e52b M3qgFy1HRU2PibDOr1YOdw 1 1 1221 20 1013.6kb 1013.6kb red open worker_stats_d1bd92a3b039400cbafc60a7a5b1e52b_2021-05 EQK8mnlhRxCrrKK3clcUFA 1 1 red open queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2021-05 LZBWjeupRiuDPM50EB-0Ow 1 1 yellow open events-training_debug_image-d1bd92a3b039400cbafc60a7a5b1e52b CP9YJDjMQLOJ3KNJ9ydobA 1 1 16 0 24.7kb 24.7kb yellow open events-plot-d1bd92a3b039400cbafc60a7a5b1e52b FCwT4QMkR1G1aaqxvEzEnQ 1 1 3 0 12.7kb 12.7kb
Hi SubstantialElk6 , another thing that can be checked is the health of the particular ES indices. Can you please run the below command in the clearml-elastic container and post the results here?curl -XGET
BTW - spotted some issues with your docker-compose, namely:
` apiserver:
...
volumes:
...
- /Users/jax/clearml/config:/Users/jax/clearml/config
...
environment:
...
CLEARML__apiserver__pre_populate__zip_files: "/Users/jax/clearml/db-pre-populate" `Specifically, you can't have internal docker-image mounted directories to be identical to your external host directories - the server looks for the configuration (and other stuff) in fixed internal directories - your changes, among others, effectively make the server ignore the additional configuration files you might add (this is repeated in the ` fileserver ` section as well)
As for backup - this is usually done by backing up the mounted clearml/data/
folder, but it is usually done when the server is down (otherwise the data maybe not be backed-up correctly)
[root@2c7498711bef elasticsearch]# curl
{ "index" : "events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b", "shard" : 0, "primary" : false, "current_state" : "unassigned", "unassigned_info" : { "reason" : "CLUSTER_RECOVERED", "at" : "2021-05-22T11:33:38.932Z", "last_allocation_status" : "no_attempt" }, "can_allocate" : "no", "allocate_explanation" : "cannot allocate because allocation is not permitted to any of the nodes", "node_allocation_decisions" : [ { "node_id" : "yHcROdoNRWGfm0Ry062NGQ", "node_name" : "clearml", "transport_address" : "172.18.0.3:9300", "node_attributes" : { "ml.machine_memory" : "8349540352", "xpack.installed" : "true", "ml.max_open_jobs" : "20" }, "node_decision" : "no", "deciders" : [ { "decider" : "same_shard", "decision" : "NO", "explanation" : "the shard cannot be allocated to the same node on which a copy of the shard already exists [[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0], node[yHcROdoNRWGfm0Ry062NGQ], [P], s[STARTED], a[id=J-nHYEXBSymWrgpo7J2bLw]]" } ] } ] }
(trying to get as much info as possible first)
Can you do:curl
from inside the elastic container?
Can i somehow perform an export or backup?
Ok, i guess i will have to kill the whole thing and refresh it.
I'm starting to suspect Docker Desktop 🙂
(which is very rare - we're running lots of instances of the clearml server for a very long time and I've never encountered this issue 😞 )
Also, I doubt the reason was that you added workers, more likely something happened in the Elastic - be it some disk issue or something else
If you're running your own server - no, there is no limitation
and yes, there are stuff in there. In fact its been running for a few weeks with no issue. This appears to have happened after i added new workers, though i can't be sure this is the cause. Is there a limit to the number of workers that i can add for community edition?
[root@2c7498711bef elasticsearch]# curl
{ "cluster_name" : "clearml", "status" : "red", "timed_out" : false, "number_of_nodes" : 1, "number_of_data_nodes" : 1, "active_primary_shards" : 4, "active_shards" : 4, "relocating_shards" : 0, "initializing_shards" : 0, "unassigned_shards" : 8, "delayed_unassigned_shards" : 0, "number_of_pending_tasks" : 0, "number_of_in_flight_fetch" : 0, "task_max_waiting_in_queue_millis" : 0, "active_shards_percent_as_number" : 33.33333333333333 }
docker exec clearml-elastic curl
zsh: no matches found:
No no, do:docker exec -it clearml-elastic /bin/bash
And than from the bash inside the container, do:curl
Well, assuming your server data is properly mounted outside of the dockers, restarting the server will be just fine. Can you verify the externally mapped folders actually contain data? (in /Users/jax/clearml/data/mongo
, /Users/jax/clearml/data/elastic_7
etc...)
docker exec clearml-elastic curl
zsh: no matches found:
yes, previously run experiments. I will just kill clearml-elastic container if that may solve the problem.