As for backup - this is usually done by backing up the mounted clearml/data/
folder, but it is usually done when the server is down (otherwise the data maybe not be backed-up correctly)
Can you do:curl
from inside the elastic container?
Can i somehow perform an export or backup?
Ok, i guess i will have to kill the whole thing and refresh it.
I'm not familiar with elastic. What role does elastic play in ClearML?
It seems both the queue_metrics
and worker_stats
indices are in red status
docker exec clearml-elastic curl
zsh: no matches found:
Thanks that did solve the problem, the tasks are running again.
[root@2c7498711bef elasticsearch]# curl
{ "cluster_name" : "clearml", "status" : "red", "timed_out" : false, "number_of_nodes" : 1, "number_of_data_nodes" : 1, "active_primary_shards" : 4, "active_shards" : 4, "relocating_shards" : 0, "initializing_shards" : 0, "unassigned_shards" : 8, "delayed_unassigned_shards" : 0, "number_of_pending_tasks" : 0, "number_of_in_flight_fetch" : 0, "task_max_waiting_in_queue_millis" : 0, "active_shards_percent_as_number" : 33.33333333333333 }
Also, I doubt the reason was that you added workers, more likely something happened in the Elastic - be it some disk issue or something else
Hi SubstantialElk6 , another thing that can be checked is the health of the particular ES indices. Can you please run the below command in the clearml-elastic container and post the results here?curl -XGET
I'm starting to suspect Docker Desktop 🙂
Well, assuming your server data is properly mounted outside of the dockers, restarting the server will be just fine. Can you verify the externally mapped folders actually contain data? (in /Users/jax/clearml/data/mongo
, /Users/jax/clearml/data/elastic_7
etc...)
docker exec clearml-elastic curl
zsh: no matches found:
No no, do:docker exec -it clearml-elastic /bin/bash
And than from the bash inside the container, do:curl
(which is very rare - we're running lots of instances of the clearml server for a very long time and I've never encountered this issue 😞 )
SubstantialElk6 Both indices that are red are not critical for the ClearML functioning and can be deleted like this:curl -XDELETE '
' curl -XDELETE '
'
For the analysis of the possible reasons that lead to it can you please collect the full ES logs to the file and send it here?sudo docker logs clearml-elastic > log.txt 2>&1
[root@2c7498711bef elasticsearch]# curl
{ "index" : "events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b", "shard" : 0, "primary" : false, "current_state" : "unassigned", "unassigned_info" : { "reason" : "CLUSTER_RECOVERED", "at" : "2021-05-22T11:33:38.932Z", "last_allocation_status" : "no_attempt" }, "can_allocate" : "no", "allocate_explanation" : "cannot allocate because allocation is not permitted to any of the nodes", "node_allocation_decisions" : [ { "node_id" : "yHcROdoNRWGfm0Ry062NGQ", "node_name" : "clearml", "transport_address" : "172.18.0.3:9300", "node_attributes" : { "ml.machine_memory" : "8349540352", "xpack.installed" : "true", "ml.max_open_jobs" : "20" }, "node_decision" : "no", "deciders" : [ { "decider" : "same_shard", "decision" : "NO", "explanation" : "the shard cannot be allocated to the same node on which a copy of the shard already exists [[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0], node[yHcROdoNRWGfm0Ry062NGQ], [P], s[STARTED], a[id=J-nHYEXBSymWrgpo7J2bLw]]" } ] } ] }
yes, previously run experiments. I will just kill clearml-elastic container if that may solve the problem.
Well, Elastic is used to store and index all experiment and worker metrics
[root@2c7498711bef elasticsearch]# curl -XGET
yellow open events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b 4hAFNtGkRr-CHNGnUYfbTA 1 1 4724 271 660.9kb 660.9kb yellow open events-log-d1bd92a3b039400cbafc60a7a5b1e52b M3qgFy1HRU2PibDOr1YOdw 1 1 1221 20 1013.6kb 1013.6kb red open worker_stats_d1bd92a3b039400cbafc60a7a5b1e52b_2021-05 EQK8mnlhRxCrrKK3clcUFA 1 1 red open queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2021-05 LZBWjeupRiuDPM50EB-0Ow 1 1 yellow open events-training_debug_image-d1bd92a3b039400cbafc60a7a5b1e52b CP9YJDjMQLOJ3KNJ9ydobA 1 1 16 0 24.7kb 24.7kb yellow open events-plot-d1bd92a3b039400cbafc60a7a5b1e52b FCwT4QMkR1G1aaqxvEzEnQ 1 1 3 0 12.7kb 12.7kb
BTW - spotted some issues with your docker-compose, namely:
` apiserver:
...
volumes:
...
- /Users/jax/clearml/config:/Users/jax/clearml/config
...
environment:
...
CLEARML__apiserver__pre_populate__zip_files: "/Users/jax/clearml/db-pre-populate" `Specifically, you can't have internal docker-image mounted directories to be identical to your external host directories - the server looks for the configuration (and other stuff) in fixed internal directories - your changes, among others, effectively make the server ignore the additional configuration files you might add (this is repeated in the ` fileserver ` section as well)
If you're running your own server - no, there is no limitation
and yes, there are stuff in there. In fact its been running for a few weeks with no issue. This appears to have happened after i added new workers, though i can't be sure this is the cause. Is there a limit to the number of workers that i can add for community edition?
(trying to get as much info as possible first)