Greetings! Could You Help Me? I’Ve Just Tried Delete Old Experiment (Year Ago) But Got The Following Error:

Answered

Greetings!
could you help me?
I’ve just tried delete old experiment (year ago) but got the following error:
apiserver [2022-06-17 13:36:59,636] [10] [WARNING] [elasticsearch] POST [status: N/A request:60.055s] apiserver Traceback (most recent call last): apiserver File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 449, in _make_request apiserver six.raise_from(e, None) apiserver File "<string>", line 3, in raise_from apiserver File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 444, in _make_request apiserver httplib_response = conn.getresponse() apiserver File "/usr/lib64/python3.6/http/client.py", line 1346, in getresponse apiserver response.begin() apiserver File "/usr/lib64/python3.6/http/client.py", line 307, in begin apiserver version, status, reason = self._read_status() apiserver File "/usr/lib64/python3.6/http/client.py", line 268, in _read_status apiserver line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1") apiserver File "/usr/lib64/python3.6/socket.py", line 586, in readinto apiserver return self._sock.recv_into(b) apiserver socket.timeout: timed out apiserver During handling of the above exception, another exception occurred: apiserver Traceback (most recent call last): apiserver File "/usr/local/lib/python3.6/site-packages/elasticsearch/connection/http_urllib3.py", line 252, in perform_request apiserver method, url, body, retries=Retry(False), headers=request_headers, **kw apiserver File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 786, in urlopen apiserver method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2] apiserver File "/usr/local/lib/python3.6/site-packages/urllib3/util/retry.py", line 525, in increment apiserver raise six.reraise(type(error), error, _stacktrace) apiserver File "/usr/local/lib/python3.6/site-packages/urllib3/packages/six.py", line 770, in reraise apiserver raise value apiserver File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 710, in urlopen apiserver chunked=chunked, apiserver File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 451, in _make_request apiserver self._raise_timeout(err=e, url=url, timeout_value=read_timeout) apiserver File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 341, in _raise_timeout apiserver self, url, "Read timed out. (read timeout=%s)" % timeout_value apiserver urllib3.exceptions.ReadTimeoutError: HTTPConnectionPool(host='elasticsearch-service', port='9200'): Read timed out. (read timeout=60)Can I increase timeout?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					ResponsiveCamel97
				
					0
					 × 1

Votes Newest

Answers 29

Also I tried delete tasks by api, like this:
` >>> from clearml_agent import APIClient

client = APIClient()
client.tasks.get_all(system_tags=["archived"])
+----------------------------------+------------------------------------------------------------+
| id | name |
+----------------------------------+------------------------------------------------------------+
| 41cb804da24747abb362fb5ca0414fe6 | 15.0.95 |
client.tasks.delete('41cb804da24747abb362fb5ca0414fe6')

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.9/site-packages/clearml_agent/backend_api/session/client/client.py", line 374, in new_func
return Response(self.session.send(request_cls(*args, **kwargs)))
File "/usr/local/lib/python3.9/site-packages/clearml_agent/backend_api/session/client/client.py", line 122, in send
raise APIError(result)
clearml_agent.backend_api.session.client.client.APIError: APIError: code 400/101: Invalid task id: id=41cb804da24747abb362fb5ca0414fe6, company=d1bd92a3b039400cbafc60a7a5b1e52b `But It doesn’t work too

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					ResponsiveCamel97
				
					0
					 × 1

It seems your server has issues with the ES service, this should be general and not related to the delete itself - can you try doing sudo docker ps ?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Infrastructure in k8s
but when I check healt of cluster, I’ve got green status
curl localhost:9200/_cluster/health
{"cluster_name":"clearml","status":"green","timed_out":false,"number_of_nodes":1,"number_of_data_nodes":1,"active_primary_shards":41,"active_shards":41,"relocating_shards":0,"initializing_shards":0,"unassigned_shards":0,"delayed_unassigned_shards":0,"number_of_pending_tasks":0,"number_of_in_flight_fetch":0,"task_max_waiting_in_queue_millis":0,"active_shards_percent_as_number":100.0}

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					ResponsiveCamel97
				
					0
					 × 1

Recently, the free space on pv ended and the cluster switched to read_only_allow_delete. I’ve tried remove old experiments, but it didn’t help and I got the same error.

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					ResponsiveCamel97
				
					0
					 × 1

Then I changed the size of the PV and added an extra 50Gb
Looks like it helped and now the service is working, but I still get this bug.

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					ResponsiveCamel97
				
					0
					 × 1

Only when you try to delete these tasks?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Delete, reset

looks like something with index
index shard time type stage source_host source_node target_host target_node repository snapshot files files_recovered files_percent files_total bytes bytes_recovered bytes_percent bytes_total translog_ops translog_ops_recovered translog_ops_percent events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b 0 2.4h existing_store done n/a n/a 10.18.13.96 clearml n/a n/a 0 0 100.0% 238 0 0 100.0% 55032286631 959750 959750 100.0%very much confuses high recovery time, translog_ops and translog_ops_recovered
We have the same clearml in stage env for tests, and if this clearml restart elasticsearch everything will be fine
index shard time type stage source_host source_node target_host target_node repository snapshot files files_recovered files_percent files_total bytes bytes_recovered bytes_percent bytes_total translog_ops translog_ops_recovered translog_ops_percent events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b 0 5s existing_store done n/a n/a 10.18.11.137 clearml n/a n/a 0 0 100.0% 253 0 0 100.0% 53429363732 0 0 100.0%How to solve this problem with index without deleting it

And many of the following bugs in the API logs
apiserver [2022-06-19 08:32:51,912] [10] [WARNING] [urllib3.connectionpool] Connection pool is full, discarding connection: elasticsearch-service. Connection pool size: 10

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					ResponsiveCamel97
				
					0
					 × 1

apiserver [2022-06-19 08:32:51,912] [10] [WARNING] [urllib3.connectionpool] Connection pool is full, discarding connection: elasticsearch-service. Connection pool size: 10

This is just a warning and can be disregarded - it only means an unused connection is discarded, nothing more.

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

ResponsiveCamel97 , can you send the output of:
curl -XGETand:
curl -XGET

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

sure
First command output
curl -XGET health status index uuid pri rep docs.count docs.deleted store.size pri.store.size green open queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2021-10 xjVdUpdDReCv5g11c4IGFw 1 0 10248782 0 536.6mb 536.6mb green open queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2021-11 YuxjrptlTh2MlOCU7ykMkA 1 0 13177592 0 695.6mb 695.6mb green open queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2022-01 CXZ8edSSR_C3f-264gPSxw 1 0 17178186 0 891.8mb 891.8mb green open events-log-d1bd92a3b039400cbafc60a7a5b1e52b Urte-26hTRmm9syCc3lIGQ 1 0 37510243 6511399 12.8gb 12.8gb green open events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b 70zX8fwURuyXdjHcc6TNaQ 1 0 374684303 24869857 51.4gb 51.4gb green open queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2022-04 oY8hM0BUTP6Zki-krHkEJg 1 0 12258567 0 634.5mb 634.5mb green open queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2022-05 9FWIKsugQf2XF2asGkZcTA 1 0 10015124 0 513.9mb 513.9mb green open worker_stats_d1bd92a3b039400cbafc60a7a5b1e52b_2021-10 5GouX7CiTqy0KnqLe-jGUQ 1 0 39513094 0 2.4gb 2.4gb green open worker_stats_d1bd92a3b039400cbafc60a7a5b1e52b_2022-01 Nz8T5sd0QNW9dJQM0UoOnw 1 0 40993955 0 2.5gb 2.5gb green open queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2022-02 aw6X3LPASLahZ-EMWSkYRA 1 0 15713573 0 807.5mb 807.5mb green open worker_stats_d1bd92a3b039400cbafc60a7a5b1e52b_2021-11 Empmo9cdQ9eYqPiqVakAOA 1 0 39530759 0 2.4gb 2.4gb green open queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2021-12 PfrlVBsRSHiBaB-C13AuFw 1 0 8801479 0 459.2mb 459.2mb green open queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2022-03 G9gsKlLqTLmSfFRIUKxhpA 1 0 12396061 0 640.1mb 640.1mb green open worker_stats_d1bd92a3b039400cbafc60a7a5b1e52b_2022-02 vJ-XUAEfSbaUS-DlLz23Zg 1 0 37301997 0 2.2gb 2.2gb green open worker_stats_d1bd92a3b039400cbafc60a7a5b1e52b_2021-12 981MwI1nT8KxQJ_Cjkb0uA 1 0 30484228 0 1.9gb 1.9gb green open events-training_debug_image-d1bd92a3b039400cbafc60a7a5b1e52b 2oiWS6VHRuuT6m9OtvOYIg 1 0 135153 56191 31.7mb 31.7mb green open queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2022-06 hW4mi0bDQA2S-jM5KXGILQ 1 0 4273551 0 245.4mb 245.4mb green open .geoip_databases iYPbj6vsS0-Tm_PGo49UHw 1 0 41 41 38.9mb 38.9mb green open worker_stats_d1bd92a3b039400cbafc60a7a5b1e52b_2022-03 5MS5I7fGRLGQgM3S8EbF1A 1 0 40349234 0 2.4gb 2.4gb green open worker_stats_d1bd92a3b039400cbafc60a7a5b1e52b_2022-04 1C4QazTaTWyuo8HSNSzRmw 1 0 33531158 0 2gb 2gb green open worker_stats_d1bd92a3b039400cbafc60a7a5b1e52b_2022-05 YPe4zRb7Q92DeaSSvTlGdg 1 0 32807469 0 1.9gb 1.9gb green open worker_stats_d1bd92a3b039400cbafc60a7a5b1e52b_2022-06 hu3N2iQgRGC9xYQi84NCsw 1 0 17636277 0 1.1gb 1.1gb green open events-plot-d1bd92a3b039400cbafc60a7a5b1e52b l4BpBPIeRfyUfodRxIzRtg 1 0 43640 3967 95.6mb 95.6mbSecond command output
index shard prirep state docs store ip node worker_stats_d1bd92a3b039400cbafc60a7a5b1e52b_2021-11 0 p STARTED 39530759 2.4gb elastic-ip clearml queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2021-12 0 p STARTED 8801479 459.2mb elastic-ip clearml queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2022-03 0 p STARTED 12396061 640.1mb elastic-ip clearml queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2022-05 0 p STARTED 10015124 513.9mb elastic-ip clearml worker_stats_d1bd92a3b039400cbafc60a7a5b1e52b_2022-05 0 p STARTED 32807469 1.9gb elastic-ip clearml worker_stats_d1bd92a3b039400cbafc60a7a5b1e52b_2022-04 0 p STARTED 33531158 2gb elastic-ip clearml .ds-.logs-deprecation.elasticsearch-default-2022.01.25-000004 0 p STARTED elastic-ip clearml .ds-.logs-deprecation.elasticsearch-default-2021.12.14-000001 0 p STARTED elastic-ip clearml events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b 0 p STARTED 374684303 51.4gb elastic-ip clearml .ds-ilm-history-5-2022.06.12-000010 0 p STARTED elastic-ip clearml worker_stats_d1bd92a3b039400cbafc60a7a5b1e52b_2022-03 0 p STARTED 40349234 2.4gb elastic-ip clearml events-plot-d1bd92a3b039400cbafc60a7a5b1e52b 0 p STARTED 43640 95.6mb elastic-ip clearml worker_stats_d1bd92a3b039400cbafc60a7a5b1e52b_2021-12 0 p STARTED 30484228 1.9gb elastic-ip clearml .ds-.logs-deprecation.elasticsearch-default-2022.02.22-000006 0 p STARTED elastic-ip clearml .ds-.logs-deprecation.elasticsearch-default-2022.04.05-000009 0 p STARTED elastic-ip clearml .ds-ilm-history-5-2022.03.14-000004 0 p STARTED elastic-ip clearml .ds-.logs-deprecation.elasticsearch-default-2022.04.19-000010 0 p STARTED elastic-ip clearml .ds-.logs-deprecation.elasticsearch-default-2021.12.28-000002 0 p STARTED elastic-ip clearml worker_stats_d1bd92a3b039400cbafc60a7a5b1e52b_2021-10 0 p STARTED 39513094 2.4gb elastic-ip clearml queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2021-11 0 p STARTED 13177592 695.6mb elastic-ip clearml worker_stats_d1bd92a3b039400cbafc60a7a5b1e52b_2022-06 0 p STARTED 17636637 1.1gb elastic-ip clearml events-training_debug_image-d1bd92a3b039400cbafc60a7a5b1e52b 0 p STARTED 135153 31.7mb elastic-ip clearml .ds-.logs-deprecation.elasticsearch-default-2022.06.15-000014 0 p STARTED elastic-ip clearml queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2021-10 0 p STARTED 10248782 536.6mb elastic-ip clearml events-log-d1bd92a3b039400cbafc60a7a5b1e52b 0 p STARTED 37510244 12.8gb elastic-ip clearml .ds-.logs-deprecation.elasticsearch-default-2022.03.08-000007 0 p STARTED elastic-ip clearml queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2022-06 0 p STARTED 4273551 245.4mb elastic-ip clearml .ds-.logs-deprecation.elasticsearch-default-2022.02.08-000005 0 p STARTED elastic-ip clearml .ds-.logs-deprecation.elasticsearch-default-2022.05.03-000011 0 p STARTED elastic-ip clearml worker_stats_d1bd92a3b039400cbafc60a7a5b1e52b_2022-02 0 p STARTED 37301997 2.2gb elastic-ip clearml .ds-.logs-deprecation.elasticsearch-default-2022.05.31-000013 0 p STARTED elastic-ip clearml .ds-.logs-deprecation.elasticsearch-default-2022.03.22-000008 0 p STARTED elastic-ip clearml .ds-ilm-history-5-2022.04.13-000006 0 p STARTED elastic-ip clearml worker_stats_d1bd92a3b039400cbafc60a7a5b1e52b_2022-01 0 p STARTED 40993955 2.5gb elastic-ip clearml .ds-.logs-deprecation.elasticsearch-default-2022.01.11-000003 0 p STARTED elastic-ip clearml queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2022-02 0 p STARTED 15713573 807.5mb elastic-ip clearml queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2022-04 0 p STARTED 12258567 634.5mb elastic-ip clearml .ds-.logs-deprecation.elasticsearch-default-2022.05.17-000012 0 p STARTED elastic-ip clearml .geoip_databases 0 p STARTED 41 38.9mb elastic-ip clearml queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2022-01 0 p STARTED 17178186 891.8mb elastic-ip clearml .ds-ilm-history-5-2022.05.13-000008 0 p STARTED elastic-ip clearml

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					ResponsiveCamel97
				
					0
					 × 1

I just hided elastic IP in the second output

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					ResponsiveCamel97
				
					0
					 × 1

Hi ResponsiveCamel97 , the shards and indices stats look fine. Can you please try the async delete of the task data? You can run the following line in the shell inside the apiserver container. Just replace <task_id> with your actual task id
curl -XPOST -H "Content-Type: application/json" " " -d'{"query": {"term": {"task": "<task_id>"}}}'You should get in response something like this:
{"task":"p6350SG7STmQALxH-E3CLg:1426125"}Then you can periodically ping ES on the status of the running operation:
curl -XGET " <copy here the ES task that you received above>"Let's see how much time the async delete task will eventually take and what amount of data will be deleted

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AppetizingMouse58
				
					0

I’ve tried with these two
>>> client.tasks.get_all(system_tags=["archived"]) +----------------------------------+------------------------------------------------------------+ | id | name | +----------------------------------+------------------------------------------------------------+ | 378c8e80c3dd4ff8901f04f00824acbd | ab-ai-767-easy | | c575db3f302441c6a977f52c060c135d | ab-ai-767-hard |This is output for the first task ab-ai-767-easy
# curl -XGET " " { "completed" : true, "task" : { "node" : "gjlBdFdETTqe3snnYbTcGQ", "id" : 9856290, "type" : "transport", "action" : "indices:data/write/delete/byquery", "status" : { "total" : 0, "updated" : 0, "created" : 0, "deleted" : 0, "batches" : 0, "version_conflicts" : 0, "noops" : 0, "retries" : { "bulk" : 0, "search" : 0 }, "throttled_millis" : 0, "requests_per_second" : -1.0, "throttled_until_millis" : 0 }, "description" : "delete-by-query [events-*-d1bd92a3b039400cbafc60a7a5b1e52b]", "start_time_in_millis" : 1655723441902, "running_time_in_nanos" : 19219813692, "cancellable" : true, "cancelled" : false, "headers" : { } }, "response" : { "took" : 19217, "timed_out" : false, "total" : 0, "updated" : 0, "created" : 0, "deleted" : 0, "batches" : 0, "version_conflicts" : 0, "noops" : 0, "retries" : { "bulk" : 0, "search" : 0 }, "throttled" : "0s", "throttled_millis" : 0, "requests_per_second" : -1.0, "throttled_until" : "0s", "throttled_until_millis" : 0, "failures" : [ ] } }and for the second
root@elasticsearch-7859849f67-8755p:/usr/share/elasticsearch# curl -XPOST -H "Content-Type: application/json" " " -d'{"query": {"term": {"task": "c575db3f302441c6a977f52c060c135d"}}}' {"task":"gjlBdFdETTqe3snnYbTcGQ:9857749"}root@elasticsearch-7859849f67-8755p:/usr/share/elasticsearch# root@elasticsearch-7859849f67-8755p:/usr/share/elasticsearch# curl -XGET " " { "completed" : true, "task" : { "node" : "gjlBdFdETTqe3snnYbTcGQ", "id" : 9857749, "type" : "transport", "action" : "indices:data/write/delete/byquery", "status" : { "total" : 0, "updated" : 0, "created" : 0, "deleted" : 0, "batches" : 0, "version_conflicts" : 0, "noops" : 0, "retries" : { "bulk" : 0, "search" : 0 }, "throttled_millis" : 0, "requests_per_second" : -1.0, "throttled_until_millis" : 0 }, "description" : "delete-by-query [events-*-d1bd92a3b039400cbafc60a7a5b1e52b]", "start_time_in_millis" : 1655723651286, "running_time_in_nanos" : 16276854116, "cancellable" : true, "cancelled" : false, "headers" : { } }, "response" : { "took" : 16276, "timed_out" : false, "total" : 0, "updated" : 0, "created" : 0, "deleted" : 0, "batches" : 0, "version_conflicts" : 0, "noops" : 0, "retries" : { "bulk" : 0, "search" : 0 }, "throttled" : "0s", "throttled_millis" : 0, "requests_per_second" : -1.0, "throttled_until" : "0s", "throttled_until_millis" : 0, "failures" : [ ] } }but, I still see this tasks in the web interface and I see it in output from api
Although in the output above I see that these tasks removed successfully
"completed" : true,

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					ResponsiveCamel97
				
					0
					 × 1

And developers complain to me that they can’t start experiment
APIError: code 500/100: General data error (ConnectionTimeout caused by - ReadTimeoutError(HTTPConnectionPool(host='elasticsearch-service', port='9200'): Read timed out. (read timeout=60))) Failed deleting old session ffaa2192fb9045359e7c9827ff5e1e55 APIError: code 500/100: General data error (ConnectionTimeout caused by - ReadTimeoutError(HTTPConnectionPool(host='elasticsearch-service', port='9200'): Read timed out. (read timeout=60))) Failed deleting old session 63bd918c23d74108ae1c74a373435f01

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					ResponsiveCamel97
				
					0
					 × 1

The tasks themselves will stay until you succeed to delete them from the client. Here we tried to see why deleting their data from ES timed out. From what I see no data was actually deleted (most likely because of the previous delete efforts that actually deleted the data though caused time out in the apiserver). What seems problematic is the amount of time that each operation took (19 and 16 seconds). It may be due to insufficient memory/cpu allocation to ES container or due to the 50Gb index size

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AppetizingMouse58
				
					0

With what memory setting do you run ES? How much memory and cpu is currently occupied by ES container?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AppetizingMouse58
				
					0

at the moment ES has the following resources
Limits: cpu: 2 memory: 10G Requests: cpu: 2 memory: 10GWe launched ES with these parameters at the time of the problems

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					ResponsiveCamel97
				
					0
					 × 1

Developers complain that the experiments are long hung in the status of Pending
more than 10 minutes

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					ResponsiveCamel97
				
					0
					 × 1

What are the env vars passed to ES in k8s?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

- env: - name: bootstrap.memory_lock value: "true" - name: cluster.name value: clearml - name: cluster.routing.allocation.node_initial_primaries_recoveries value: "500" - name: cluster.routing.allocation.disk.watermark.low value: 500mb - name: cluster.routing.allocation.disk.watermark.high value: 500mb - name: cluster.routing.allocation.disk.watermark.flood_stage value: 500mb - name: discovery.zen.minimum_master_nodes value: "1" - name: discovery.type value: "single-node" - name: http.compression_level value: "1" - name: node.ingest value: "true" - name: node.name value: clearml - name: reindex.remote.whitelist value: '*.*' - name: xpack.monitoring.enabled value: "false" - name: xpack.security.enabled value: "false" - name: ES_JAVA_OPTS value: "-Xms8g -Xmx8g -Dlog4j2.formatMsgNoLookups=true"

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					ResponsiveCamel97
				
					0
					 × 1

value: "-Xms8g -Xmx8g -Dlog4j2.formatMsgNoLookups=true"I would recommend using at least value: "-Xms16g -Xmx16g -Dlog4j2.formatMsgNoLookups=true"

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

And adjusting the pod allocation accordingly

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

ok, lets try
but it’s a lot of resources

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					ResponsiveCamel97
				
					0
					 × 1

what interesting, that a new experiments clearml can delete without any problems
but old archived experiments, clearml didn’t want remove

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					ResponsiveCamel97
				
					0
					 × 1

and I still see this error in the logs
[2022-06-20 13:24:27,777] [9] [WARNING] [elasticsearch] POST [status:N/A request :60.060s] Traceback (most recent call last): File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 449, in _make_request six.raise_from(e, None) File "<string>", line 3, in raise_from File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 444, in _make_request httplib_response = conn.getresponse() File "/usr/lib64/python3.6/http/client.py", line 1346, in getresponse response.begin() File "/usr/lib64/python3.6/http/client.py", line 307, in begin version, status, reason = self._read_status() File "/usr/lib64/python3.6/http/client.py", line 268, in _read_status line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1") File "/usr/lib64/python3.6/socket.py", line 586, in readinto return self._sock.recv_into(b) socket.timeout: timed out During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/local/lib/python3.6/site-packages/elasticsearch/connection/http_urllib3.py", line 252, in perform_request method, url, body, retries=Retry(False), headers=request_headers, **kw File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 786, in urlopen method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2] File "/usr/local/lib/python3.6/site-packages/urllib3/util/retry.py", line 525, in increment raise six.reraise(type(error), error, _stacktrace) File "/usr/local/lib/python3.6/site-packages/urllib3/packages/six.py", line 770, in reraise raise value File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 710, in urlopen chunked=chunked, File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 451, in _make_request self._raise_timeout(err=e, url=url, timeout_value=read_timeout) File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 341, in _raise_timeout self, url, "Read timed out. (read timeout=%s)" % timeout_value urllib3.exceptions.ReadTimeoutError: HTTPConnectionPool(host='elasticsearch-service', port='9200'): Read timed out. (read timeout=60)

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					ResponsiveCamel97
				
					0
					 × 1

Yet the experiments have stopped normally. In the body of the experiment writes aborted, but at the same time I see it on the dashboard

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					ResponsiveCamel97
				
					0
					 × 1

I recovered the ES data from the backup
It helped.

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					ResponsiveCamel97
				
					0
					 × 1

Anyway, if there was any additional information for troubleshooting or backups on the site would be very cool.

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					ResponsiveCamel97
				
					0
					 × 1

Yeah, we're constantly trying to improve that... 🙂

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Write your answer

2K Views

29 Answers

3 years ago

2 years ago