Hey! I Stumbled Upon Some Errors With My Workers Monitoring. I Checked Logs In My K8S Pods For Apiserver And Elasticsearch And It Seems The Problem Is There. These Are The Logs: Apiserver Logs [2021-04-23 06:19:50,209] [9] [Error] [Trains.Service_Repo] Re

Answered

Hey! I stumbled upon some errors with my workers monitoring.
I checked logs in my k8s pods for apiserver and elasticsearch and it seems the problem is there. These are the logs:
Apiserver logs
[2021-04-23 06:19:50,209] [9] [ERROR] [trains.service_repo] Returned 500 for workers.get_activity_report in 4059ms, msg=General data error (RequestError(400, 'search_phase_execution_exception', 'Text fields are not optimised for operations that require per-document field data like aggregations and sorting, so these operations are disabled by default. Please use a keyword field instead. Alternatively, set fielddata=true on [worker] in order to load field data by uninverting the inverted index. Note that this can use significant memory.'))
[2021-04-23 06:19:50,211] [9] [ERROR] [trains.service_repo] Returned 500 for workers.get_activity_report in 4059ms, msg=General data error (RequestError(400, 'search_phase_execution_exception', 'Text fields are not optimised for operations that require per-document field data like aggregations and sorting, so these operations are disabled by default. Please use a keyword field instead. Alternatively, set fielddata=true on [worker] in order to load field data by uninverting the inverted index. Note that this can use significant memory.'))
[2021-04-23 06:19:59,024] [9] [ERROR] [trains.service_repo] Returned 500 for workers.get_stats in 13ms, msg=General data error (RequestError(400, 'search_phase_execution_exception', 'Text fields are not optimised for operations that require per-document field data like aggregations and sorting, so these operations are disabled by default. Please use a keyword field instead. Alternatively, set fielddata=true on [worker] in order to load field data by uninverting the inverted index. Note that this can use significant memory.'))

Elasticsearch logs
"at java.lang.Thread.run(Thread.java:830) [?:?]",
"Caused by: java.lang.IllegalArgumentException: Text fields are not optimised for operations that require per-document field data like aggregations and sorting, so these operations are disabled by default. Please use a keyword field instead. Alternatively, set fielddata=true on [worker] in order to load field data by uninverting the inverted index. Note that this can use significant memory.",

I recently changed the settings for elasticsearch vm and set ES_JAVA_OPTS from "-Xms2g -Xmx2g" to "-Xms1g -Xmx1g" but I don't think this should be a problem. I also mounted the azurefiles as data for elasticsearch. I dont know when did the problem appear and I am wondering whether you might know what the problem is. Maybe there is some mismatch between api versions? I get these errors from api request {"meta":{"id":"d729564d050d407a86d8579dbafab0c1","trx":"d729564d050d407a86d8579dbafab0c1","endpoint":{"name":"workers.get_activity_report","requested_version":"2.12","actual_version":"2.4"},"result_code":500,"result_subcode":0,"result_msg":"name 'OrderedDict' is not defined","error_stack":null,"error_data":{}},"data":{}}. Thanks in advance.

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					GreasyPenguin66
				
					0
					 × 1

Votes Newest

Answers 15

Hi SuccessfulKoala55 Thanks for the response. For elastic I am using the image http://docker.elastic.co/elasticsearch/elasticsearch:7.6.2 the one that is in manifests in clearml repo. As for the clearml images I am using the latest tags everywhere. Let me restore the vm settings for elastic and I'll let you know ;)

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					GreasyPenguin66
				
					0
					 × 1

I had some problems previously when changing something in apiserver forced me to redeploy everything in order for clearml to work properly. And I am wondering whether you have maybe some guidelines for that.

Can you elaborate on that? Basically, the apiserver can be restarted while other components (including mongodb, elstic and redis) stay up

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Hey SuccessfulKoala55 Thank you for your answers I really appreciate it. As for elasticsearch it was indeed the index error that was created before. The reason for that is that I was trying to setup a backup for elasticsearch and mongodb using azurefiles. So the scenario is I'm using persistent volumes on k8s that are using azure file shares as storage. Then I can rebuild my cluster and use the exact same storage so that the data is persistent and I can restore my application from the last state. I noticed a while ago that when trying to setup clearml from scratch using the backup data, the apiserver failed on mongodb migrations. Now after the issues with elasticsearch I know it concerns also elasticsearch. Therefore my question is, is there any way to reinitialize clearml but using the back up data for the services? Thank you in advance

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					GreasyPenguin66
				
					0
					 × 1

Also, what clearml-server version are you running, and what is the Elastic image version?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

And also another question came to my mind. When changing any deployment for clearml like apiserver or mongo or elasticsearch etc. do I have to redeploy everything from the scratch? I had some problems previously when changing something in apiserver forced me to redeploy everything in order for clearml to work properly. And I am wondering whether you have maybe some guidelines for that.

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					GreasyPenguin66
				
					0
					 × 1

Unfortunately the problem was not resolved nor by changing the vm memory settings back to 2 gb and by going back from azurefiles persistent volumes to hostPath. Seems odd as I did not have any of these issues before. I thought it might come from the changes in PV and elasticsearch settings but going back to the original settings did not resolve the issue. Shouldn't I be using the latest tag for clearml?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					GreasyPenguin66
				
					0
					 × 1

Are your backups from the same ClearML server version?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Shouldn't I be using the latest tag for clearml?

What do you mean by the latest tag? What deployment are you using exactly?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Yes they are. With mongo I had a problem connected with azurefiles and mongo who did not approve to mount azurefiles under /data/db as it could not initialize. The solution for that was to mount the azurefiles under different path and then specify command for mongo with path to the data so that it could initialize properly. However when I deleted a kubernetes cluster, created a new one and I redeployed clearml I had issues coming not from mongo anymore but from apiserver that was failing with migrations. And then I noticed the problem with elasticsearch that also had azurefiles mounted as volume. So I am wondering whether my errors are particular and there should be no problems with reinitializing the clearml environment using the data from previous environment, or is there something obvious that I am not following 😉

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					GreasyPenguin66
				
					0
					 × 1

As for the clearml server version by latest tag I meant v 0.17.0

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					GreasyPenguin66
				
					0
					 × 1

To check the server does try to perform this initialization on startup, check the apiserver pod/container log for log lines similar to this:
[2021-04-23 12:44:35,493] [31795] [INFO] [trains.initialize] Applying mappings to ES host: [ConfigTree([('host', '127.0.0.1'), ('port', 9200)])] [2021-04-23 12:44:36,193] [31795] [INFO] [trains.initialize] [{'mapping': 'events', 'result': {'acknowledged': True}}, {'mapping': 'events_training_debug_image', 'result': {'acknowledged': True}}, {'mapping': 'events_plot', 'result': {'acknowledged': True}}, {'mapping': 'events_log', 'result': {'acknowledged': True}}] [2021-04-23 12:44:36,193] [31795] [INFO] [trains.initialize] Applying mappings to ES host: [ConfigTree([('host', '127.0.0.1'), ('port', 9200)])] [2021-04-23 12:44:36,805] [31795] [INFO] [trains.initialize] [{'mapping': 'queue_metrics', 'result': {'acknowledged': True}}, {'mapping': 'worker_stats', 'result': {'acknowledged': True}}]Note: this should be around the start of the log

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

If this is indeed performed by the server, the issue is most likely an woker_stats stats index that was somehow created before the mapping could be applied - the quickest solution is to manually delete the index.

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

It might be caused by a memory issue - I'd suggest restoring the VMs memory setting, just to make sure it's not the cause for the issue

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Hi again GreasyPenguin66 🙂
For some reason, it looks like the mapping for the Elastic index containing the worker (agents) statistics were not initialized correctly - this happens automatically when the ClearML server starts up. The server might not perform this auto-initialization in case it suspects the ES data as originating from an un-migrated pre-v16 Trains Server deployment (I'm not sure this is the case here)

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Hi GreasyPenguin66 , this looks like some sort of a mapping issue in Elastic...

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Write your answer

1K Views

15 Answers

3 years ago

one year ago