Hi I Have A Problem (Presumably) With K8Sglue-Agents Working Roughly 50% Of The Time. Basically What Happens Is When I Launch A Pipeline From Command Line, It Ends Up In "Pending" State And Dissapears From All Of The Queues. Initially It First Appears On

Answered

I have a problem (presumably) with k8sglue-agents working roughly 50% of the time. Basically what happens is when I launch a pipeline from command line, it ends up in "Pending" state and dissapears from all of the queues. Initially it first appears on our k8s-scheduler queue, from where it is then allocated to the correct queue once an agent is up (when stuff works). By digging a little into the logs of our k8sglue-agent, I was able to find the following snippet from the crash logs, which seems to occur whenever the process fails:

` FATAL ERROR:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py", line 710, in urlopen
chunked=chunked,
File "/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py", line 449, in _make_request
six.raise_from(e, None)
File "<string>", line 3, in raise_from
File "/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py", line 444, in _make_request
httplib_response = conn.getresponse()
File "/usr/lib/python3.6/http/client.py", line 1377, in getresponse
response.begin()
File "/usr/lib/python3.6/http/client.py", line 320, in begin
version, status, reason = self._read_status()
File "/usr/lib/python3.6/http/client.py", line 281, in _read_status
line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
File "/usr/lib/python3.6/socket.py", line 586, in readinto
return self._sock.recv_into(b)
File "/usr/lib/python3.6/ssl.py", line 1012, in recv_into
return self.read(nbytes, buffer)
File "/usr/lib/python3.6/ssl.py", line 874, in read
return self._sslobj.read(len, buffer)
File "/usr/lib/python3.6/ssl.py", line 631, in read
v = self._sslobj.read(len, buffer)
ConnectionResetError: [Errno 104] Connection reset by peer

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/requests/adapters.py", line 449, in send
timeout=timeout
File "/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py", line 786, in urlopen
method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
File "/usr/local/lib/python3.6/dist-packages/urllib3/util/retry.py", line 550, in increment
raise six.reraise(type(error), error, _stacktrace)
File "/usr/local/lib/python3.6/dist-packages/urllib3/packages/six.py", line 769, in reraise
raise value.with_traceback(tb)
File "/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py", line 710, in urlopen
chunked=chunked,
File "/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py", line 449, in _make_request
six.raise_from(e, None)
File "<string>", line 3, in raise_from
File "/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py", line 444, in _make_request
httplib_response = conn.getresponse()
File "/usr/lib/python3.6/http/client.py", line 1377, in getresponse
response.begin()
File "/usr/lib/python3.6/http/client.py", line 320, in begin
version, status, reason = self._read_status()
File "/usr/lib/python3.6/http/client.py", line 281, in _read_status
line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
File "/usr/lib/python3.6/socket.py", line 586, in readinto
return self._sock.recv_into(b)
File "/usr/lib/python3.6/ssl.py", line 1012, in recv_into
return self.read(nbytes, buffer)
File "/usr/lib/python3.6/ssl.py", line 874, in read
return self._sslobj.read(len, buffer)
File "/usr/lib/python3.6/ssl.py", line 631, in read
v = self._sslobj.read(len, buffer)
urllib3.exceptions.ProtocolError: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/clearml_agent/commands/worker.py", line 1449, in daemon
gpu_queues=dynamic_gpus,
File "/usr/local/lib/python3.6/dist-packages/clearml_agent/glue/k8s.py", line 774, in run_tasks_loop
level="INFO",
File "/usr/local/lib/python3.6/dist-packages/clearml_agent/commands/events.py", line 99, in send_log_events
return self.send_events(list_events=log_events, session=session)
File "/usr/local/lib/python3.6/dist-packages/clearml_agent/commands/events.py", line 58, in send_events
sent_events += send_packet(lines)
File "/usr/local/lib/python3.6/dist-packages/clearml_agent/commands/events.py", line 32, in send_packet
'add_batch', data=jsonlines, headers={'Content-type': 'application/json-lines'}, session=session
File "/usr/local/lib/python3.6/dist-packages/clearml_agent/commands/base.py", line 127, in post
return session.post(service=self.service, action=endpoint, *args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/clearml_agent/session.py", line 300, in post
json=json or kwargs)
File "/usr/local/lib/python3.6/dist-packages/clearml_agent/session.py", line 308, in _manual_request
json=json or kwargs)
File "/usr/local/lib/python3.6/dist-packages/clearml_agent/backend_api/session/session.py", line 363, in send_request
json=json,
File "/usr/local/lib/python3.6/dist-packages/clearml_agent/backend_api/session/session.py", line 291, in _send_request
method, url, headers=headers, auth=auth, data=data, json=json, timeout=timeout)
File "/usr/local/lib/python3.6/dist-packages/requests/sessions.py", line 542, in request
resp = self.send(prep, **send_kwargs)
File "/usr/local/lib/python3.6/dist-packages/requests/sessions.py", line 655, in send
r = adapter.send(request, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/requests/adapters.py", line 498, in send
raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer')) `

  				
Posted 
	2 years ago

					More  		
  Report
		
					SlimyDove85
				
					0
					 × 1

Votes Newest

Answers 10

SlimyDove85 this seems to be some network error with the ClearML Server

  				
Posted 
	2 years ago

					More  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

This is clearly a network issue; first I’d check there are no restarts of apiserver during that timespan. It’s not easy to debug this since it looks to be random but it can be interesting to check k8s networking configuration overall just to be sure.

  				
Posted 
	2 years ago

					More  		
  Report
		
					JuicyFox94
				
					0
					 × 1

Queue structure:

  				
Posted 
	2 years ago

					More  		
  Report
		
					SlimyDove85
				
					0
					 × 1

API server does not restart during the process. I'll try to see if I catch up something in its logs or where should I monitor the networking in? I.e., what is the flow 😅

  				
Posted 
	2 years ago

					More  		
  Report
		
					SlimyDove85
				
					0
					 × 1

We have CloudWatch also configured, so I could probably do some searches there if I knew what to look for

  				
Posted 
	2 years ago

					More  		
  Report
		
					SlimyDove85
				
					0
					 × 1

this is a connection fail from agent to apiserver. the flow should be aget-pod -> apiserver svc -> apiserver pod. maybe also apiserver can have something in ogs that can be checked

  				
Posted 
	2 years ago

					More  		
  Report
		
					JuicyFox94
				
					0
					 × 1

JuicyFox94 any idea?

  				
Posted 
	2 years ago

					More  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Ok, was able to get a crash and log some output from the apiserver:

[2022-08-11 09:21:13,727] [11] [INFO] [clearml.service_repo] Returned 200 for tasks.stopped in 17ms [2022-08-11 09:21:13,829] [11] [INFO] [clearml.service_repo] Returned 200 for queues.get_next_task in 11ms [2022-08-11 09:21:13,871] [11] [INFO] [clearml.service_repo] Returned 200 for queues.get_next_task in 8ms [2022-08-11 09:21:13,986] [11] [WARNING] [clearml.service_repo] Returned 400 for queues.get_by_id in 4ms, msg=Invalid queue id: id=feature_pipelines, company=d1bd92a3b039400cbafc60a7a5b1e52b [2022-08-11 09:21:14,217] [11] [INFO] [clearml.service_repo] Returned 200 for queues.get_all in 10ms [2022-08-11 09:21:14,491] [11] [INFO] [clearml.service_repo] Returned 200 for tasks.enqueue in 21ms [2022-08-11 09:21:15,125] [11] [INFO] [clearml.service_repo] Returned 200 for debug.ping in 0ms [2022-08-11 09:21:15,128] [11] [INFO] [clearml.service_repo] Returned 200 for debug.ping in 0ms [2022-08-11 09:21:15,677] [11] [INFO] [clearml.service_repo] Returned 200 for queues.get_all in 2ms [2022-08-11 09:21:15,754] [11] [INFO] [clearml.service_repo] Returned 200 for queues.get_next_task in 9ms [2022-08-11 09:21:17,728] [11] [INFO] [clearml.service_repo] Returned 200 for tasks.get_all_ex in 22ms [2022-08-11 09:21:18,845] [11] [WARNING] [clearml.service_repo] Returned 400 for in 0ms, msg=Invalid request path / [2022-08-11 09:21:18,847] [11] [WARNING] [clearml.service_repo] Returned 400 for in 0ms, msg=Invalid request path / [2022-08-11 09:21:18,854] [11] [WARNING] [clearml.service_repo] Returned 400 for in 0ms, msg=Invalid request path / [2022-08-11 09:21:19,152] [11] [INFO] [clearml.service_repo] Returned 200 for queues.get_next_task in 9ms [2022-08-11 09:21:19,158] [11] [INFO] [clearml.service_repo] Returned 200 for queues.get_next_task in 10ms [2022-08-11 09:21:20,212] [11] [INFO] [clearml.service_repo] Returned 200 for workers.status_report in 4ms [2022-08-11 09:21:20,277] [11] [WARNING] [clearml.service_repo] Returned 400 for queues.create in 6ms, msg=Value combination already exists (unique field already contains this value): name=df6a44f0f80648a3a2edc0a970944ba7, company=d1bd92a3b039400cbafc60a7a5b1e52b

  				
Posted 
	2 years ago

					More  		
  Report
		
					SlimyDove85
				
					0
					 × 1

The queue 'feature_pipelines" should exist and the latter queue is something that the agents sometimes want to create for some reason (though it should not be required?)

Latter warning is ok I guess.

  				
Posted 
	2 years ago

					More  		
  Report
		
					SlimyDove85
				
					0
					 × 1

On AWS EKS with:
Image: allegroai/clearml-agent-k8s-base
clearml-agent version: 1.2.4rc3
python: 3.6.9

  				
Posted 
	2 years ago

					More  		
  Report
		
					SlimyDove85
				
					0
					 × 1

Write your answer

1K Views

10 Answers

2 years ago