I Suddenly Get

Answered

I Suddenly Get

I suddenly get Error 100 : General data error (TransportError(503, 'search_phase_execution_exception')) when trying to access Results --> Scalars. Any idea why?

Maybe revalant parts from the apiserver.log:
[2021-05-26 17:40:35,194] [9] [ERROR] [clearml.service_repo] Returned 500 for events.scalar_metrics_iter_histogram in 18ms, msg=General data error (TransportError(503, 'search_phase_execution_exception'))
[2021-05-26 17:40:34,335] [9] [ERROR] [clearml.service_repo] Returned 500 for events.add_batch in 60012ms, msg=General data error: err=('28 document(s) failed to index.', [{'index': {'_index': 'events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b', '_type': '_doc', '_id': 'a2e63a763c4068504fd2d26665ec644f', 'status': 503,..., extra_info=[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0]] containing [28] requests and a refresh]

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					ReassuredTiger98
				
					0
					 × 1

Votes Newest

Answers 30

I created an github issue because the problem with the slow deletion still exists. https://github.com/allegroai/clearml/issues/586#issue-1142916619

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					ReassuredTiger98
				
					0
					 × 1

OK, we'll take a look and get back to you 🙂

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

That's usually true 😄

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Is it working now?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

It seems that for some reasons not all shards (meaning indices, which are where the data is indexed) are up, but I have no idea why

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Restart did not fix it, but somehow looking at tasks works again.

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					ReassuredTiger98
				
					0
					 × 1

However, deleting tasks gives me errors.

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					ReassuredTiger98
				
					0
					 × 1

Maybe deletion happens "async" and is not reflected in parts of clearml? It seems that if I try to delete often enough at some point it is successfull

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					ReassuredTiger98
				
					0
					 × 1

Okay, it seems like it just takes some time to delete and to reflect in the WebUI. So when I try to delete again, actually a deletion process seems already to be running in the background.

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					ReassuredTiger98
				
					0
					 × 1

Well, that depends on the amount of data registered - it might take Elastic time to reindex...

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

It also seems like the deletion operation will slow down the server substantially.

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					ReassuredTiger98
				
					0
					 × 1

Btw, can you bash into the elastic container and get some info?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

[root@dc01deffca35 elasticsearch]# curl { "cluster_name" : "clearml", "status" : "yellow", "timed_out" : false, "number_of_nodes" : 1, "number_of_data_nodes" : 1, "active_primary_shards" : 10, "active_shards" : 10, "relocating_shards" : 0, "initializing_shards" : 0, "unassigned_shards" : 10, "delayed_unassigned_shards" : 0, "number_of_pending_tasks" : 0, "number_of_in_flight_fetch" : 0, "task_max_waiting_in_queue_millis" : 0, "active_shards_percent_as_number" : 50.0 } [root@dc01deffca35 elasticsearch]# curl yellow open events-log-d1bd92a3b039400cbafc60a7a5b1e52b hVmpOK3jSTu70P2iq73gQg 1 1 3895575 1987186 2.7gb 2.7gb yellow open events-plot- RGsBmP0ATm-eAcjmO7g07w 1 1 173 0 444.9kb 444.9kb yellow open events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b 9KyNOaWDQrSEGur5EHmYng 1 1 379634665 123815996 69.9gb 69.9gb yellow open worker_stats_d1bd92a3b039400cbafc60a7a5b1e52b_2021-05 UdUSFjRbTUm3ipUR7BYNrw 1 1 3996966 0 276.5mb 276.5mb yellow open events-training_debug_image- yC84lTIcSeGuWDp1tTjCRw 1 1 189 0 78.2kb 78.2kb yellow open events-training_debug_image-d1bd92a3b039400cbafc60a7a5b1e52b izfS1NQSSQ-6unvT5crncA 1 1 21786 8358 8.5mb 8.5mb yellow open queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2021-05 KqoCxx9uQpmkyxFThq3-RQ 1 1 1560657 0 83.9mb 83.9mb yellow open events-plot-d1bd92a3b039400cbafc60a7a5b1e52b Zg3yMULaQVCn7XXuGZnJHA 1 1 250 9026 125.2mb 125.2mb yellow open events-log- 1rAf70nETguPJUQuk6NJsg 1 1 2215 0 602.9kb 602.9kb yellow open events-training_stats_scalar- ZORZKCR_ROuzm_-LC7-IXw 1 1 7174 0 979.7kb 979.7kb [root@dc01deffca35 elasticsearch]# curl { "error" : { "root_cause" : [ { "type" : "circuit_breaking_exception", "reason" : "[parent] Data too large, data for [<http_request>] would be [7944925456/7.3gb], which is larger than the limit of [7888427417/7.3gb], real usage: [7944925456/7.3gb], new bytes reserved: [0/0b], usages [request=0/0b, fielddata=71795/70.1kb, in_flight_requests=20658132/19.7mb, accounting=158823383/151.4mb]", "bytes_wanted" : 7944925456, "bytes_limit" : 7888427417, "durability" : "PERMANENT" } ], "type" : "circuit_breaking_exception", "reason" : "[parent] Data too large, data for [<http_request>] would be [7944925456/7.3gb], which is larger than the limit of [7888427417/7.3gb], real usage: [7944925456/7.3gb], new bytes reserved: [0/0b], usages [request=0/0b, fielddata=71795/70.1kb, in_flight_requests=20658132/19.7mb, accounting=158823383/151.4mb]", "bytes_wanted" : 7944925456, "bytes_limit" : 7888427417, "durability" : "PERMANENT" }, "status" : 429 } [root@dc01deffca35 elasticsearch]# curl {"error":{"root_cause":[{"type":"circuit_breaking_exception","reason":"[parent] Data too large, data for [<http_request>] would be [8293445776/7.7gb], which is larger than the limit of [7888427417/7.3gb], real usage: [8293445776/7.7gb], new bytes reserved: [0/0b], usages [request=32880/32.1kb, fielddata=71795/70.1kb, in_flight_requests=20658132/19.7mb, accounting=158823383/151.4mb]","bytes_wanted":8293445776,"bytes_limit":7888427417,"durability":"PERMANENT"}],"type":"circuit_breaking_exception","reason":"[parent] Data too large, data for [<http_request>] would be [8293445776/7.7gb], which is larger than the limit of [7888427417/7.3gb], real usage: [8293445776/7.7gb], new bytes reserved: [0/0b], usages [request=32880/32.1kb, fielddata=71795/70.1kb, in_flight_requests=20658132/19.7mb, accounting=158823383/151.4mb]","bytes_wanted":8293445776,"bytes_limit":7888427417,"durability":"PERMANENT"},"status":429}[root@dc01deffca35 elasticsearch]#

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					ReassuredTiger98
				
					0
					 × 1

SuccessfulKoala55 So what happens is, that always when/after the cleanup_service runs, clearml will throw these kind of errors

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					ReassuredTiger98
				
					0
					 × 1

It could be that either the clearml-server has bad behaviour while clean up is ongoing or even after.

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					ReassuredTiger98
				
					0
					 × 1

BTW, by cleanup service do you mean the cleanup code running in the agent-services?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Yea, the one script that is preinstalled.

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					ReassuredTiger98
				
					0
					 × 1

Here it is:

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					ReassuredTiger98
				
					0
					 × 1

ReassuredTiger98 it's strange - in the log I can see messages such as:
DEBUG Deleting Task id=<some-id> data folder <some-folder>But I can't find the source of these messages in the ClearML examples/services/cleanup/cleanup_service.py file - are you using an older version?
Also, the current script should display messages such as Deleting <num> tasks - which I also don't see in the log...

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

ReassuredTiger98 is there any chance you're running two cleanup tasks at the same time by mistake?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

I restarted it after I got the errors, because as everyone knows, turning it off and on usually works 😄

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					ReassuredTiger98
				
					0
					 × 1

Use:
docker exec -it clearml-elastic /bin/bash
and once inside, copy the output of each of the following commends:
curl curl curl curl

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Might help me figure out if there's anything out of order...

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Hi ReassuredTiger98 ,
I think the first things to do it to disable the cleanup service, until we figure this out 🙂

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Thanks for the logs!

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

ReassuredTiger98 would it be possible to receive the entire output of the cleanup script? It's stored as the log for the cleanup task

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

ReassuredTiger98 I see now that you're probably using an older version of the cleanup service

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

It is server version 1.0 and everything that came with it.

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					ReassuredTiger98
				
					0
					 × 1

No, it is only a single one.

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					ReassuredTiger98
				
					0
					 × 1

Here is a part of the cleanup service log. Unfortunately, I cannot even download the full log currently, because the clearml-server will just throw errors for everything.

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					ReassuredTiger98
				
					0
					 × 1

Write your answer

598 Views

30 Answers

2 years ago

one year ago