It is server version 1.0 and everything that came with it.
Hi ReassuredTiger98 ,
I think the first things to do it to disable the cleanup service, until we figure this out 🙂
Here is a part of the cleanup service log. Unfortunately, I cannot even download the full log currently, because the clearml-server will just throw errors for everything.
Restart did not fix it, but somehow looking at tasks works again.
Btw, can you bash into the elastic container and get some info?
Might help me figure out if there's anything out of order...
Maybe deletion happens "async" and is not reflected in parts of clearml? It seems that if I try to delete often enough at some point it is successfull
It also seems like the deletion operation will slow down the server substantially.
Well, that depends on the amount of data registered - it might take Elastic time to reindex...
[root@dc01deffca35 elasticsearch]# curl
{ "cluster_name" : "clearml", "status" : "yellow", "timed_out" : false, "number_of_nodes" : 1, "number_of_data_nodes" : 1, "active_primary_shards" : 10, "active_shards" : 10, "relocating_shards" : 0, "initializing_shards" : 0, "unassigned_shards" : 10, "delayed_unassigned_shards" : 0, "number_of_pending_tasks" : 0, "number_of_in_flight_fetch" : 0, "task_max_waiting_in_queue_millis" : 0, "active_shards_percent_as_number" : 50.0 } [root@dc01deffca35 elasticsearch]# curl
yellow open events-log-d1bd92a3b039400cbafc60a7a5b1e52b hVmpOK3jSTu70P2iq73gQg 1 1 3895575 1987186 2.7gb 2.7gb yellow open events-plot- RGsBmP0ATm-eAcjmO7g07w 1 1 173 0 444.9kb 444.9kb yellow open events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b 9KyNOaWDQrSEGur5EHmYng 1 1 379634665 123815996 69.9gb 69.9gb yellow open worker_stats_d1bd92a3b039400cbafc60a7a5b1e52b_2021-05 UdUSFjRbTUm3ipUR7BYNrw 1 1 3996966 0 276.5mb 276.5mb yellow open events-training_debug_image- yC84lTIcSeGuWDp1tTjCRw 1 1 189 0 78.2kb 78.2kb yellow open events-training_debug_image-d1bd92a3b039400cbafc60a7a5b1e52b izfS1NQSSQ-6unvT5crncA 1 1 21786 8358 8.5mb 8.5mb yellow open queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2021-05 KqoCxx9uQpmkyxFThq3-RQ 1 1 1560657 0 83.9mb 83.9mb yellow open events-plot-d1bd92a3b039400cbafc60a7a5b1e52b Zg3yMULaQVCn7XXuGZnJHA 1 1 250 9026 125.2mb 125.2mb yellow open events-log- 1rAf70nETguPJUQuk6NJsg 1 1 2215 0 602.9kb 602.9kb yellow open events-training_stats_scalar- ZORZKCR_ROuzm_-LC7-IXw 1 1 7174 0 979.7kb 979.7kb [root@dc01deffca35 elasticsearch]# curl
{ "error" : { "root_cause" : [ { "type" : "circuit_breaking_exception", "reason" : "[parent] Data too large, data for [<http_request>] would be [7944925456/7.3gb], which is larger than the limit of [7888427417/7.3gb], real usage: [7944925456/7.3gb], new bytes reserved: [0/0b], usages [request=0/0b, fielddata=71795/70.1kb, in_flight_requests=20658132/19.7mb, accounting=158823383/151.4mb]", "bytes_wanted" : 7944925456, "bytes_limit" : 7888427417, "durability" : "PERMANENT" } ], "type" : "circuit_breaking_exception", "reason" : "[parent] Data too large, data for [<http_request>] would be [7944925456/7.3gb], which is larger than the limit of [7888427417/7.3gb], real usage: [7944925456/7.3gb], new bytes reserved: [0/0b], usages [request=0/0b, fielddata=71795/70.1kb, in_flight_requests=20658132/19.7mb, accounting=158823383/151.4mb]", "bytes_wanted" : 7944925456, "bytes_limit" : 7888427417, "durability" : "PERMANENT" }, "status" : 429 } [root@dc01deffca35 elasticsearch]# curl
{"error":{"root_cause":[{"type":"circuit_breaking_exception","reason":"[parent] Data too large, data for [<http_request>] would be [8293445776/7.7gb], which is larger than the limit of [7888427417/7.3gb], real usage: [8293445776/7.7gb], new bytes reserved: [0/0b], usages [request=32880/32.1kb, fielddata=71795/70.1kb, in_flight_requests=20658132/19.7mb, accounting=158823383/151.4mb]","bytes_wanted":8293445776,"bytes_limit":7888427417,"durability":"PERMANENT"}],"type":"circuit_breaking_exception","reason":"[parent] Data too large, data for [<http_request>] would be [8293445776/7.7gb], which is larger than the limit of [7888427417/7.3gb], real usage: [8293445776/7.7gb], new bytes reserved: [0/0b], usages [request=32880/32.1kb, fielddata=71795/70.1kb, in_flight_requests=20658132/19.7mb, accounting=158823383/151.4mb]","bytes_wanted":8293445776,"bytes_limit":7888427417,"durability":"PERMANENT"},"status":429}[root@dc01deffca35 elasticsearch]#
I restarted it after I got the errors, because as everyone knows, turning it off and on usually works 😄
ReassuredTiger98 is there any chance you're running two cleanup tasks at the same time by mistake?
ReassuredTiger98 would it be possible to receive the entire output of the cleanup script? It's stored as the log for the cleanup task
ReassuredTiger98 I see now that you're probably using an older version of the cleanup service
SuccessfulKoala55 So what happens is, that always when/after the cleanup_service runs, clearml will throw these kind of errors
Use:docker exec -it clearml-elastic /bin/bash
and once inside, copy the output of each of the following commends:curl
curl
curl
curl
ReassuredTiger98 it's strange - in the log I can see messages such as:DEBUG Deleting Task id=<some-id> data folder <some-folder>
But I can't find the source of these messages in the ClearML examples/services/cleanup/cleanup_service.py
file - are you using an older version?
Also, the current script should display messages such as Deleting <num> tasks
- which I also don't see in the log...
Okay, it seems like it just takes some time to delete and to reflect in the WebUI. So when I try to delete again, actually a deletion process seems already to be running in the background.
BTW, by cleanup service do you mean the cleanup code running in the agent-services?
Yea, the one script that is preinstalled.
It seems that for some reasons not all shards (meaning indices, which are where the data is indexed) are up, but I have no idea why
I created an github issue because the problem with the slow deletion still exists. https://github.com/allegroai/clearml/issues/586#issue-1142916619
However, deleting tasks gives me errors.
It could be that either the clearml-server has bad behaviour while clean up is ongoing or even after.
OK, we'll take a look and get back to you 🙂