Hi, When I Use The Autoscaler To Start Jobs, I Noticed Some Of Them Randomly Abort In The Middle Of The Jobs And Give The Following Error:

Answered

Hi, when I use the autoscaler to start jobs, I noticed some of them randomly abort in the middle of the jobs and give the following error: Process failed, exit code -9 . any idea what might cause this?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					CloudySwallow27
				
					0
					 × 1

Votes Newest

Answers 5

python invoked oom-killerOut of memory, CloudySwallow27 in the scaler app task, can you check if you have scalers reporting?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					TimelyPenguin76
				
					0
					 Administrator

TimelyPenguin76 , what do you think?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					CostlyOstrich36
				
					0

gotit. the instance logs showed
/var/log/syslog.1:May 5 03:25:27 ip-172-31-37-234 kernel: [53387.840425] python invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0 /var/log/syslog.1:May 5 03:25:27 ip-172-31-37-234 kernel: [53387.840442] oom_kill_process+0xe6/0x120
I assume this is something I have to fix on my end (or increase instance memory). Does ClearML also happen to have solutions for this?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					CloudySwallow27
				
					0
					 × 1

CloudySwallow27 , can you get logs from the machine itself? Are the jobs doing something specific when they fail?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					CostlyOstrich36
				
					0

We are using self-hosted clearMl w/ the following versions:

Worker CLEARML-AGENT version 1.1.2
The autoscaler instance Clearml-AGENT version: 1.2.3
ClearML WebApp: 1.2.0-153 Server: 1.2.0-153 API: 2.16
python pip package 1.3.2

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					CloudySwallow27
				
					0
					 × 1

Write your answer

2K Views

5 Answers

3 years ago

2 years ago