OK, we'll take a look and get back to you 🙂
I restarted it after I got the errors, because as everyone knows, turning it off and on usually works 😄
Hi ReassuredTiger98 ,
I think the first things to do it to disable the cleanup service, until we figure this out 🙂
Btw, can you bash into the elastic container and get some info?
ReassuredTiger98 I see now that you're probably using an older version of the cleanup service
ReassuredTiger98 would it be possible to receive the entire output of the cleanup script? It's stored as the log for the cleanup task
It is server version 1.0 and everything that came with it.
However, deleting tasks gives me errors.
ReassuredTiger98 it's strange - in the log I can see messages such as:DEBUG Deleting Task id=<some-id> data folder <some-folder>
But I can't find the source of these messages in the ClearML examples/services/cleanup/cleanup_service.py
file - are you using an older version?
Also, the current script should display messages such as Deleting <num> tasks
- which I also don't see in the log...
[root@dc01deffca35 elasticsearch]# curl
{ "cluster_name" : "clearml", "status" : "yellow", "timed_out" : false, "number_of_nodes" : 1, "number_of_data_nodes" : 1, "active_primary_shards" : 10, "active_shards" : 10, "relocating_shards" : 0, "initializing_shards" : 0, "unassigned_shards" : 10, "delayed_unassigned_shards" : 0, "number_of_pending_tasks" : 0, "number_of_in_flight_fetch" : 0, "task_max_waiting_in_queue_millis" : 0, "active_shards_percent_as_number" : 50.0 } [root@dc01deffca35 elasticsearch]# curl
yellow open events-log-d1bd92a3b039400cbafc60a7a5b1e52b hVmpOK3jSTu70P2iq73gQg 1 1 3895575 1987186 2.7gb 2.7gb yellow open events-plot- RGsBmP0ATm-eAcjmO7g07w 1 1 173 0 444.9kb 444.9kb yellow open events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b 9KyNOaWDQrSEGur5EHmYng 1 1 379634665 123815996 69.9gb 69.9gb yellow open worker_stats_d1bd92a3b039400cbafc60a7a5b1e52b_2021-05 UdUSFjRbTUm3ipUR7BYNrw 1 1 3996966 0 276.5mb 276.5mb yellow open events-training_debug_image- yC84lTIcSeGuWDp1tTjCRw 1 1 189 0 78.2kb 78.2kb yellow open events-training_debug_image-d1bd92a3b039400cbafc60a7a5b1e52b izfS1NQSSQ-6unvT5crncA 1 1 21786 8358 8.5mb 8.5mb yellow open queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2021-05 KqoCxx9uQpmkyxFThq3-RQ 1 1 1560657 0 83.9mb 83.9mb yellow open events-plot-d1bd92a3b039400cbafc60a7a5b1e52b Zg3yMULaQVCn7XXuGZnJHA 1 1 250 9026 125.2mb 125.2mb yellow open events-log- 1rAf70nETguPJUQuk6NJsg 1 1 2215 0 602.9kb 602.9kb yellow open events-training_stats_scalar- ZORZKCR_ROuzm_-LC7-IXw 1 1 7174 0 979.7kb 979.7kb [root@dc01deffca35 elasticsearch]# curl
{ "error" : { "root_cause" : [ { "type" : "circuit_breaking_exception", "reason" : "[parent] Data too large, data for [<http_request>] would be [7944925456/7.3gb], which is larger than the limit of [7888427417/7.3gb], real usage: [7944925456/7.3gb], new bytes reserved: [0/0b], usages [request=0/0b, fielddata=71795/70.1kb, in_flight_requests=20658132/19.7mb, accounting=158823383/151.4mb]", "bytes_wanted" : 7944925456, "bytes_limit" : 7888427417, "durability" : "PERMANENT" } ], "type" : "circuit_breaking_exception", "reason" : "[parent] Data too large, data for [<http_request>] would be [7944925456/7.3gb], which is larger than the limit of [7888427417/7.3gb], real usage: [7944925456/7.3gb], new bytes reserved: [0/0b], usages [request=0/0b, fielddata=71795/70.1kb, in_flight_requests=20658132/19.7mb, accounting=158823383/151.4mb]", "bytes_wanted" : 7944925456, "bytes_limit" : 7888427417, "durability" : "PERMANENT" }, "status" : 429 } [root@dc01deffca35 elasticsearch]# curl
{"error":{"root_cause":[{"type":"circuit_breaking_exception","reason":"[parent] Data too large, data for [<http_request>] would be [8293445776/7.7gb], which is larger than the limit of [7888427417/7.3gb], real usage: [8293445776/7.7gb], new bytes reserved: [0/0b], usages [request=32880/32.1kb, fielddata=71795/70.1kb, in_flight_requests=20658132/19.7mb, accounting=158823383/151.4mb]","bytes_wanted":8293445776,"bytes_limit":7888427417,"durability":"PERMANENT"}],"type":"circuit_breaking_exception","reason":"[parent] Data too large, data for [<http_request>] would be [8293445776/7.7gb], which is larger than the limit of [7888427417/7.3gb], real usage: [8293445776/7.7gb], new bytes reserved: [0/0b], usages [request=32880/32.1kb, fielddata=71795/70.1kb, in_flight_requests=20658132/19.7mb, accounting=158823383/151.4mb]","bytes_wanted":8293445776,"bytes_limit":7888427417,"durability":"PERMANENT"},"status":429}[root@dc01deffca35 elasticsearch]#
Yea, the one script that is preinstalled.
It could be that either the clearml-server has bad behaviour while clean up is ongoing or even after.
Use:docker exec -it clearml-elastic /bin/bash
and once inside, copy the output of each of the following commends:curl
curl
curl
curl
Restart did not fix it, but somehow looking at tasks works again.
SuccessfulKoala55 So what happens is, that always when/after the cleanup_service runs, clearml will throw these kind of errors
It also seems like the deletion operation will slow down the server substantially.
Well, that depends on the amount of data registered - it might take Elastic time to reindex...
BTW, by cleanup service do you mean the cleanup code running in the agent-services?
It seems that for some reasons not all shards (meaning indices, which are where the data is indexed) are up, but I have no idea why
Maybe deletion happens "async" and is not reflected in parts of clearml? It seems that if I try to delete often enough at some point it is successfull
Might help me figure out if there's anything out of order...
Okay, it seems like it just takes some time to delete and to reflect in the WebUI. So when I try to delete again, actually a deletion process seems already to be running in the background.
I created an github issue because the problem with the slow deletion still exists. https://github.com/allegroai/clearml/issues/586#issue-1142916619
ReassuredTiger98 is there any chance you're running two cleanup tasks at the same time by mistake?
Here is a part of the cleanup service log. Unfortunately, I cannot even download the full log currently, because the clearml-server will just throw errors for everything.