Suddenly All Experiments We Try To Log Run Into An Error. I Think It'S A Server Thing At Our Side, Because As Far As I Know Nothing Changed About Trains (We Didn'T Update Or Anything) And Yesterday It Was Working Well. Can Anyone Provide Some Insights At

Answered

Suddenly all experiments we try to log run into an error. I think it's a server thing at our side, because as far as I know nothing changed about Trains (we didn't update or anything) and yesterday it was working well.

Can anyone provide some insights at what exactly is going wrong in the following message?:
2020-11-10 12:56:03,492 - trains.log - WARNING - failed logging task to backend (1 lines, <500/100: events.add_batch/v1.0 (General data error: err=('1 document(s) failed to index.', [{'index': {'_index': 'events-log-d1bd92a3b039400cbafc60a7a5b1e52b', '_type': '_doc', '_id': 'c0c9cbbf1a154690b71f2623b7c15ada', 'status': 403, 'error': {'type': 'cluster_block_exception', 'reason': 'index [events-log-d1bd92a3b039400cbafc60a7a5b1e52b] blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];'}, 'data': {'timestamp': 1604980561463, 'type': 'log', 'task': '9544164723554d3085346f7cd33580d1', 'level': 'info', 'worker': 'ubuntu-user', 'msg': 'TRAINS Task: created new task id=9544164723554d3085346f7cd33580d1\nTRAINS results page: \n======> WARNING! UNCOMMITTED CHANGES IN REPOSITORY <======', '@timestamp': '2020-11-10T03:56:03.481Z', 'metric': '', 'variant': ''}}}]), extra_info=index [events-log-d1bd92a3b039400cbafc60a7a5b1e52b] blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];)>)I have a feeling this part of the message:

'status': 403

provides some useful info.

Any tips on how to debug / where to look to solve this problem?

  				
Posted 
	4 years ago

					More  		
  Report
		
					DefeatedCrab47
				
					0
					 × 1

Votes Newest

Answers 7

It seems to be related to trains-apiserver , based on the log inside the Docker compose:

trains-apiserver | [2020-11-10 04:40:14,133] [8] [ERROR] [trains.service_repo] Returned 500 for queues.get_next_task in 20ms, msg=General data error: err=('1 document(s) failed to index.', [{'index': {'_index': 'queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2020-11', '_type': '_doc', '_id': 'rkh0sHUBwyiZSyeZUAov', 'status': 403, 'error': {'type': 'cluster_block_exception', 'reason': 'index [queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2020-11] blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];'}, 'data': {'timestamp': 1604983214115, 'queue': '789a8744857746de84db036d65de8c65', 'average_waiting_time': 0, 'queue_length': 0}}}]), extra_info=index [queue_metrics_d1bd92a3b039400cbafc60a7a5b1e52b_2020-11] blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];

  				
Posted 
	4 years ago

					More  		
  Report
		
					DefeatedCrab47
				
					0
					 × 1

Even when I do a "clean install" (renamed the /opt/trains ) folder and followed the instructions to setup TRAINS, the error appears.

  				
Posted 
	4 years ago

					More  		
  Report
		
					DefeatedCrab47
				
					0
					 × 1

DefeatedCrab47 this issue has repeated here several times - is caused by low disk space on your server machine causing g elastic search to go into a read-only mode

  				
Posted 
	4 years ago

					More  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Just clear up more space on your server disk - by default elastic will switch to this mode when less than 5 percent of the disk is free

  				
Posted 
	4 years ago

					More  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

SuccessfulKoala55 Thank you. I stared myself dead at trains-apiserver , but by coincidence I found this message:
trains-elastic | {"type": "server", "timestamp": "2020-11-10T06:11:08,956Z", "level": "WARN", "component": "o.e.c.r.a.DiskThresholdMonitor", "cluster.name": "trains", "node.name": "trains", "message": "flood stage disk watermark [95%] exceeded on [QyZ2i1mxTG6yR7uhVWjV9Q][trains][/usr/share/elasticsearch/data/nodes/0] free: 43.3gb[4.7%], all indices on this node will be marked read-only", "cluster.uuid": "sDf_05oOQmm5euASjIp3Fw", "node.id": "QyZ2i1mxTG6yR7uhVWjV9Q" }So I was about to post that it's likely due to our disk getting full.

Thank you for your insights!

  				
Posted 
	4 years ago

					More  		
  Report
		
					DefeatedCrab47
				
					0
					 × 1

Sure. Just for reference, here's a related GitHub issue: https://github.com/allegroai/trains-server/issues/58

  				
Posted 
	4 years ago

					More  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Also, I'll try to make sure that starting from the next version the server will incorporate a better error heuristic (for example adding text saying Check your server disk space or something to that effect 🙂 )

  				
Posted 
	4 years ago

					More  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Write your answer

930 Views

7 Answers

4 years ago

one year ago