Unanswered
			
			
 
			
	
		
			
		
		
		
		
	
			
		
		Hi All, We Have Clearml-Server Running On A Kube Pod, And Then A Gpu Server Running The Clearml-Agent Which We Use To Queue Jobs.
For Some Reason, Our Kube Pod Restarted (We'Re Looking Into Why), But In The Process Of This Happening All Jobs On The Worke
Hi all, we're still suffering this issue where jobs are seemingly randomly aborted. The only clue is this in the ClearML logs:
2024-12-13 06:16:30  Process terminated by user
The only pattern we can see is that it typically happens around 6-7am.
Any suggestions on how to debug this would be greatly appreciated!
163 Views
				0
Answers
				
					 
	10 months ago
				
					
						 
	10 months ago