Infrastructure in k8s
but when I check healt of cluster, I’ve got green statuscurl localhost:9200/_cluster/health
{"cluster_name":"clearml","status":"green","timed_out":false,"number_of_nodes":1,"number_of_data_nodes":1,"active_primary_shards":41,"active_shards":41,"relocating_shards":0,"initializing_shards":0,"unassigned_shards":0,"delayed_unassigned_shards":0,"number_of_pending_tasks":0,"number_of_in_flight_fetch":0,"task_max_waiting_in_queue_millis":0,"active_shards_percent_as_number":100.0}
what interesting, that a new experiments clearml can delete without any problems
but old archived experiments, clearml didn’t want remove
sure
First command outputcurl -XGET
health status index uuid pri rep docs.count docs.deleted store.size pri.store.size green open queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2021-10 xjVdUpdDReCv5g11c4IGFw 1 0 10248782 0 536.6mb 536.6mb green open queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2021-11 YuxjrptlTh2MlOCU7ykMkA 1 0 13177592 0 695.6mb 695.6mb green open queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2022-01 CXZ8edSSR_C3f-264gPSxw 1 0 17178186 0 891.8mb 891.8mb green open events-log-d1bd92a3b039400cbafc60a7a5b1e52b Urte-26hTRmm9syCc3lIGQ 1 0 37510243 6511399 12.8gb 12.8gb green open events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b 70zX8fwURuyXdjHcc6TNaQ 1 0 374684303 24869857 51.4gb 51.4gb green open queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2022-04 oY8hM0BUTP6Zki-krHkEJg 1 0 12258567 0 634.5mb 634.5mb green open queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2022-05 9FWIKsugQf2XF2asGkZcTA 1 0 10015124 0 513.9mb 513.9mb green open worker_stats_d1bd92a3b039400cbafc60a7a5b1e52b_2021-10 5GouX7CiTqy0KnqLe-jGUQ 1 0 39513094 0 2.4gb 2.4gb green open worker_stats_d1bd92a3b039400cbafc60a7a5b1e52b_2022-01 Nz8T5sd0QNW9dJQM0UoOnw 1 0 40993955 0 2.5gb 2.5gb green open queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2022-02 aw6X3LPASLahZ-EMWSkYRA 1 0 15713573 0 807.5mb 807.5mb green open worker_stats_d1bd92a3b039400cbafc60a7a5b1e52b_2021-11 Empmo9cdQ9eYqPiqVakAOA 1 0 39530759 0 2.4gb 2.4gb green open queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2021-12 PfrlVBsRSHiBaB-C13AuFw 1 0 8801479 0 459.2mb 459.2mb green open queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2022-03 G9gsKlLqTLmSfFRIUKxhpA 1 0 12396061 0 640.1mb 640.1mb green open worker_stats_d1bd92a3b039400cbafc60a7a5b1e52b_2022-02 vJ-XUAEfSbaUS-DlLz23Zg 1 0 37301997 0 2.2gb 2.2gb green open worker_stats_d1bd92a3b039400cbafc60a7a5b1e52b_2021-12 981MwI1nT8KxQJ_Cjkb0uA 1 0 30484228 0 1.9gb 1.9gb green open events-training_debug_image-d1bd92a3b039400cbafc60a7a5b1e52b 2oiWS6VHRuuT6m9OtvOYIg 1 0 135153 56191 31.7mb 31.7mb green open queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2022-06 hW4mi0bDQA2S-jM5KXGILQ 1 0 4273551 0 245.4mb 245.4mb green open .geoip_databases iYPbj6vsS0-Tm_PGo49UHw 1 0 41 41 38.9mb 38.9mb green open worker_stats_d1bd92a3b039400cbafc60a7a5b1e52b_2022-03 5MS5I7fGRLGQgM3S8EbF1A 1 0 40349234 0 2.4gb 2.4gb green open worker_stats_d1bd92a3b039400cbafc60a7a5b1e52b_2022-04 1C4QazTaTWyuo8HSNSzRmw 1 0 33531158 0 2gb 2gb green open worker_stats_d1bd92a3b039400cbafc60a7a5b1e52b_2022-05 YPe4zRb7Q92DeaSSvTlGdg 1 0 32807469 0 1.9gb 1.9gb green open worker_stats_d1bd92a3b039400cbafc60a7a5b1e52b_2022-06 hu3N2iQgRGC9xYQi84NCsw 1 0 17636277 0 1.1gb 1.1gb green open events-plot-d1bd92a3b039400cbafc60a7a5b1e52b l4BpBPIeRfyUfodRxIzRtg 1 0 43640 3967 95.6mb 95.6mb
Second command outputindex shard prirep state docs store ip node worker_stats_d1bd92a3b039400cbafc60a7a5b1e52b_2021-11 0 p STARTED 39530759 2.4gb elastic-ip clearml queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2021-12 0 p STARTED 8801479 459.2mb elastic-ip clearml queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2022-03 0 p STARTED 12396061 640.1mb elastic-ip clearml queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2022-05 0 p STARTED 10015124 513.9mb elastic-ip clearml worker_stats_d1bd92a3b039400cbafc60a7a5b1e52b_2022-05 0 p STARTED 32807469 1.9gb elastic-ip clearml worker_stats_d1bd92a3b039400cbafc60a7a5b1e52b_2022-04 0 p STARTED 33531158 2gb elastic-ip clearml .ds-.logs-deprecation.elasticsearch-default-2022.01.25-000004 0 p STARTED elastic-ip clearml .ds-.logs-deprecation.elasticsearch-default-2021.12.14-000001 0 p STARTED elastic-ip clearml events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b 0 p STARTED 374684303 51.4gb elastic-ip clearml .ds-ilm-history-5-2022.06.12-000010 0 p STARTED elastic-ip clearml worker_stats_d1bd92a3b039400cbafc60a7a5b1e52b_2022-03 0 p STARTED 40349234 2.4gb elastic-ip clearml events-plot-d1bd92a3b039400cbafc60a7a5b1e52b 0 p STARTED 43640 95.6mb elastic-ip clearml worker_stats_d1bd92a3b039400cbafc60a7a5b1e52b_2021-12 0 p STARTED 30484228 1.9gb elastic-ip clearml .ds-.logs-deprecation.elasticsearch-default-2022.02.22-000006 0 p STARTED elastic-ip clearml .ds-.logs-deprecation.elasticsearch-default-2022.04.05-000009 0 p STARTED elastic-ip clearml .ds-ilm-history-5-2022.03.14-000004 0 p STARTED elastic-ip clearml .ds-.logs-deprecation.elasticsearch-default-2022.04.19-000010 0 p STARTED elastic-ip clearml .ds-.logs-deprecation.elasticsearch-default-2021.12.28-000002 0 p STARTED elastic-ip clearml worker_stats_d1bd92a3b039400cbafc60a7a5b1e52b_2021-10 0 p STARTED 39513094 2.4gb elastic-ip clearml queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2021-11 0 p STARTED 13177592 695.6mb elastic-ip clearml worker_stats_d1bd92a3b039400cbafc60a7a5b1e52b_2022-06 0 p STARTED 17636637 1.1gb elastic-ip clearml events-training_debug_image-d1bd92a3b039400cbafc60a7a5b1e52b 0 p STARTED 135153 31.7mb elastic-ip clearml .ds-.logs-deprecation.elasticsearch-default-2022.06.15-000014 0 p STARTED elastic-ip clearml queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2021-10 0 p STARTED 10248782 536.6mb elastic-ip clearml events-log-d1bd92a3b039400cbafc60a7a5b1e52b 0 p STARTED 37510244 12.8gb elastic-ip clearml .ds-.logs-deprecation.elasticsearch-default-2022.03.08-000007 0 p STARTED elastic-ip clearml queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2022-06 0 p STARTED 4273551 245.4mb elastic-ip clearml .ds-.logs-deprecation.elasticsearch-default-2022.02.08-000005 0 p STARTED elastic-ip clearml .ds-.logs-deprecation.elasticsearch-default-2022.05.03-000011 0 p STARTED elastic-ip clearml worker_stats_d1bd92a3b039400cbafc60a7a5b1e52b_2022-02 0 p STARTED 37301997 2.2gb elastic-ip clearml .ds-.logs-deprecation.elasticsearch-default-2022.05.31-000013 0 p STARTED elastic-ip clearml .ds-.logs-deprecation.elasticsearch-default-2022.03.22-000008 0 p STARTED elastic-ip clearml .ds-ilm-history-5-2022.04.13-000006 0 p STARTED elastic-ip clearml worker_stats_d1bd92a3b039400cbafc60a7a5b1e52b_2022-01 0 p STARTED 40993955 2.5gb elastic-ip clearml .ds-.logs-deprecation.elasticsearch-default-2022.01.11-000003 0 p STARTED elastic-ip clearml queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2022-02 0 p STARTED 15713573 807.5mb elastic-ip clearml queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2022-04 0 p STARTED 12258567 634.5mb elastic-ip clearml .ds-.logs-deprecation.elasticsearch-default-2022.05.17-000012 0 p STARTED elastic-ip clearml .geoip_databases 0 p STARTED 41 38.9mb elastic-ip clearml queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2022-01 0 p STARTED 17178186 891.8mb elastic-ip clearml .ds-ilm-history-5-2022.05.13-000008 0 p STARTED elastic-ip clearml
What are the env vars passed to ES in k8s?
I’ve tried with these two>>> client.tasks.get_all(system_tags=["archived"]) +----------------------------------+------------------------------------------------------------+ | id | name | +----------------------------------+------------------------------------------------------------+ | 378c8e80c3dd4ff8901f04f00824acbd | ab-ai-767-easy | | c575db3f302441c6a977f52c060c135d | ab-ai-767-hard |
This is output for the first task ab-ai-767-easy# curl -XGET "
" { "completed" : true, "task" : { "node" : "gjlBdFdETTqe3snnYbTcGQ", "id" : 9856290, "type" : "transport", "action" : "indices:data/write/delete/byquery", "status" : { "total" : 0, "updated" : 0, "created" : 0, "deleted" : 0, "batches" : 0, "version_conflicts" : 0, "noops" : 0, "retries" : { "bulk" : 0, "search" : 0 }, "throttled_millis" : 0, "requests_per_second" : -1.0, "throttled_until_millis" : 0 }, "description" : "delete-by-query [events-*-d1bd92a3b039400cbafc60a7a5b1e52b]", "start_time_in_millis" : 1655723441902, "running_time_in_nanos" : 19219813692, "cancellable" : true, "cancelled" : false, "headers" : { } }, "response" : { "took" : 19217, "timed_out" : false, "total" : 0, "updated" : 0, "created" : 0, "deleted" : 0, "batches" : 0, "version_conflicts" : 0, "noops" : 0, "retries" : { "bulk" : 0, "search" : 0 }, "throttled" : "0s", "throttled_millis" : 0, "requests_per_second" : -1.0, "throttled_until" : "0s", "throttled_until_millis" : 0, "failures" : [ ] } }
and for the secondroot@elasticsearch-7859849f67-8755p:/usr/share/elasticsearch# curl -XPOST -H "Content-Type: application/json" "
" -d'{"query": {"term": {"task": "c575db3f302441c6a977f52c060c135d"}}}' {"task":"gjlBdFdETTqe3snnYbTcGQ:9857749"}root@elasticsearch-7859849f67-8755p:/usr/share/elasticsearch# root@elasticsearch-7859849f67-8755p:/usr/share/elasticsearch# curl -XGET "
" { "completed" : true, "task" : { "node" : "gjlBdFdETTqe3snnYbTcGQ", "id" : 9857749, "type" : "transport", "action" : "indices:data/write/delete/byquery", "status" : { "total" : 0, "updated" : 0, "created" : 0, "deleted" : 0, "batches" : 0, "version_conflicts" : 0, "noops" : 0, "retries" : { "bulk" : 0, "search" : 0 }, "throttled_millis" : 0, "requests_per_second" : -1.0, "throttled_until_millis" : 0 }, "description" : "delete-by-query [events-*-d1bd92a3b039400cbafc60a7a5b1e52b]", "start_time_in_millis" : 1655723651286, "running_time_in_nanos" : 16276854116, "cancellable" : true, "cancelled" : false, "headers" : { } }, "response" : { "took" : 16276, "timed_out" : false, "total" : 0, "updated" : 0, "created" : 0, "deleted" : 0, "batches" : 0, "version_conflicts" : 0, "noops" : 0, "retries" : { "bulk" : 0, "search" : 0 }, "throttled" : "0s", "throttled_millis" : 0, "requests_per_second" : -1.0, "throttled_until" : "0s", "throttled_until_millis" : 0, "failures" : [ ] } }
but, I still see this tasks in the web interface and I see it in output from api
Although in the output above I see that these tasks removed successfully"completed" : true,
- env: - name: bootstrap.memory_lock value: "true" - name: cluster.name value: clearml - name: cluster.routing.allocation.node_initial_primaries_recoveries value: "500" - name: cluster.routing.allocation.disk.watermark.low value: 500mb - name: cluster.routing.allocation.disk.watermark.high value: 500mb - name: cluster.routing.allocation.disk.watermark.flood_stage value: 500mb - name: discovery.zen.minimum_master_nodes value: "1" - name: discovery.type value: "single-node" - name: http.compression_level value: "1" - name: node.ingest value: "true" - name: node.name value: clearml - name: reindex.remote.whitelist value: '*.*' - name: xpack.monitoring.enabled value: "false" - name: xpack.security.enabled value: "false" - name: ES_JAVA_OPTS value: "-Xms8g -Xmx8g -Dlog4j2.formatMsgNoLookups=true"
Yeah, we're constantly trying to improve that... 🙂
Recently, the free space on pv ended and the cluster switched to read_only_allow_delete. I’ve tried remove old experiments, but it didn’t help and I got the same error.
Then I changed the size of the PV and added an extra 50Gb
Looks like it helped and now the service is working, but I still get this bug.
Anyway, if there was any additional information for troubleshooting or backups on the site would be very cool.
Yet the experiments have stopped normally. In the body of the experiment writes aborted, but at the same time I see it on the dashboard
and I still see this error in the logs[2022-06-20 13:24:27,777] [9] [WARNING] [elasticsearch] POST
[status:N/A request :60.060s] Traceback (most recent call last): File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 449, in _make_request six.raise_from(e, None) File "<string>", line 3, in raise_from File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 444, in _make_request httplib_response = conn.getresponse() File "/usr/lib64/python3.6/http/client.py", line 1346, in getresponse response.begin() File "/usr/lib64/python3.6/http/client.py", line 307, in begin version, status, reason = self._read_status() File "/usr/lib64/python3.6/http/client.py", line 268, in _read_status line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1") File "/usr/lib64/python3.6/socket.py", line 586, in readinto return self._sock.recv_into(b) socket.timeout: timed out During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/local/lib/python3.6/site-packages/elasticsearch/connection/http_urllib3.py", line 252, in perform_request method, url, body, retries=Retry(False), headers=request_headers, **kw File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 786, in urlopen method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2] File "/usr/local/lib/python3.6/site-packages/urllib3/util/retry.py", line 525, in increment raise six.reraise(type(error), error, _stacktrace) File "/usr/local/lib/python3.6/site-packages/urllib3/packages/six.py", line 770, in reraise raise value File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 710, in urlopen chunked=chunked, File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 451, in _make_request self._raise_timeout(err=e, url=url, timeout_value=read_timeout) File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 341, in _raise_timeout self, url, "Read timed out. (read timeout=%s)" % timeout_value urllib3.exceptions.ReadTimeoutError: HTTPConnectionPool(host='elasticsearch-service', port='9200'): Read timed out. (read timeout=60)
Developers complain that the experiments are long hung in the status of Pending
more than 10 minutes
value: "-Xms8g -Xmx8g -Dlog4j2.formatMsgNoLookups=true"
I would recommend using at least value: "-Xms16g -Xmx16g -Dlog4j2.formatMsgNoLookups=true"
ResponsiveCamel97 , can you send the output of:curl -XGET
and:curl -XGET
apiserver [2022-06-19 08:32:51,912] [10] [WARNING] [urllib3.connectionpool] Connection pool is full, discarding connection: elasticsearch-service. Connection pool size: 10
This is just a warning and can be disregarded - it only means an unused connection is discarded, nothing more.
Only when you try to delete these tasks?
With what memory setting do you run ES? How much memory and cpu is currently occupied by ES container?
I recovered the ES data from the backup
It helped.
at the moment ES has the following resourcesLimits: cpu: 2 memory: 10G Requests: cpu: 2 memory: 10G
We launched ES with these parameters at the time of the problems
And adjusting the pod allocation accordingly
It seems your server has issues with the ES service, this should be general and not related to the delete itself - can you try doing sudo docker ps
?
Also I tried delete tasks by api, like this:
` >>> from clearml_agent import APIClient
client = APIClient()
client.tasks.get_all(system_tags=["archived"])
+----------------------------------+------------------------------------------------------------+
| id | name |
+----------------------------------+------------------------------------------------------------+
| 41cb804da24747abb362fb5ca0414fe6 | 15.0.95 |
client.tasks.delete('41cb804da24747abb362fb5ca0414fe6')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.9/site-packages/clearml_agent/backend_api/session/client/client.py", line 374, in new_func
return Response(self.session.send(request_cls(*args, **kwargs)))
File "/usr/local/lib/python3.9/site-packages/clearml_agent/backend_api/session/client/client.py", line 122, in send
raise APIError(result)
clearml_agent.backend_api.session.client.client.APIError: APIError: code 400/101: Invalid task id: id=41cb804da24747abb362fb5ca0414fe6, company=d1bd92a3b039400cbafc60a7a5b1e52b `But It doesn’t work too
Hi ResponsiveCamel97 , the shards and indices stats look fine. Can you please try the async delete of the task data? You can run the following line in the shell inside the apiserver container. Just replace <task_id> with your actual task idcurl -XPOST -H "Content-Type: application/json" "
" -d'{"query": {"term": {"task": "<task_id>"}}}'
You should get in response something like this:{"task":"p6350SG7STmQALxH-E3CLg:1426125"}
Then you can periodically ping ES on the status of the running operation:curl -XGET "
<copy here the ES task that you received above>"
Let's see how much time the async delete task will eventually take and what amount of data will be deleted
Delete, reset
looks like something with indexindex shard time type stage source_host source_node target_host target_node repository snapshot files files_recovered files_percent files_total bytes bytes_recovered bytes_percent bytes_total translog_ops translog_ops_recovered translog_ops_percent events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b 0 2.4h existing_store done n/a n/a 10.18.13.96 clearml n/a n/a 0 0 100.0% 238 0 0 100.0% 55032286631 959750 959750 100.0%
very much confuses high recovery time, translog_ops and translog_ops_recovered
We have the same clearml in stage env for tests, and if this clearml restart elasticsearch everything will be fineindex shard time type stage source_host source_node target_host target_node repository snapshot files files_recovered files_percent files_total bytes bytes_recovered bytes_percent bytes_total translog_ops translog_ops_recovered translog_ops_percent events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b 0 5s existing_store done n/a n/a 10.18.11.137 clearml n/a n/a 0 0 100.0% 253 0 0 100.0% 53429363732 0 0 100.0%
How to solve this problem with index without deleting it
And many of the following bugs in the API logsapiserver [2022-06-19 08:32:51,912] [10] [WARNING] [urllib3.connectionpool] Connection pool is full, discarding connection: elasticsearch-service. Connection pool size: 10
And developers complain to me that they can’t start experimentAPIError: code 500/100: General data error (ConnectionTimeout caused by - ReadTimeoutError(HTTPConnectionPool(host='elasticsearch-service', port='9200'): Read timed out. (read timeout=60))) Failed deleting old session ffaa2192fb9045359e7c9827ff5e1e55 APIError: code 500/100: General data error (ConnectionTimeout caused by - ReadTimeoutError(HTTPConnectionPool(host='elasticsearch-service', port='9200'): Read timed out. (read timeout=60))) Failed deleting old session 63bd918c23d74108ae1c74a373435f01
The tasks themselves will stay until you succeed to delete them from the client. Here we tried to see why deleting their data from ES timed out. From what I see no data was actually deleted (most likely because of the previous delete efforts that actually deleted the data though caused time out in the apiserver). What seems problematic is the amount of time that each operation took (19 and 16 seconds). It may be due to insufficient memory/cpu allocation to ES container or due to the 50Gb index size
I just hided elastic IP in the second output
ok, lets try
but it’s a lot of resources