Hi There, I'M Having A Slight Issue With My Kubernetes Pods Silently Failing After Downloading A Clearml Registered Dataset (Which Is Around 60Gb) As Part Of A Model Training Script. The Pods Consistently Fail After Running The

Answered

Hi there, I'm having a slight issue with my kubernetes pods silently failing after downloading a clearml registered dataset (which is around 60gb) as part of a model training script. The pods consistently fail after running the target_folder = dataset.get_mutable_local_copy() with exit code 137, implying OOM but I'm getting no error messages at all, the node seems to have plenty of storage / resource to handle the job. Anyone have any experience with this? The agent spins up the pod on a AWS g4dn.4xlarge node and executes a long running training job, but fails straight after the dataset download. I've tried limiting the max_workers on the download, and limiting the memory on the docker container, neither of which worked. I'm a bit stumped!

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					NaughtyFish36
				
					0
					 × 1

Votes Newest

Answers 4

update on this - seems like it's an error in our code which isn't being appropriate raised by the looks of things! I'll dig into it further but for now this can be left. thanks for replying!

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					NaughtyFish36
				
					0
					 × 1

Hmm yeah I have monitored some of the resource metrics and it didn't seem to be an issue. I'll attempt to install prometheus / grafana. This is a PoC however so I was hoping not to have to install too many tools.

The code running is basically this:
` if name == "main":

# initiate clear ml task
task = Task.init(
    project_name="hannd-0.1",
    task_name="train-endtoend-0.2",
    auto_connect_streams={'stdout': True, 'stderr': True, 'logging': True}
)
task.set_base_docker(docker_image="")
task.set_script(working_dir="mains/training/", entry_point="train_endtoend.py")

start_time = time.time()
args = parse_args(commands)  # Reset batch size if network is stateful

# Get a dataset
dataset = Dataset.get(dataset_id=args.clearml_dataset_id)
target_folder = dataset.get_mutable_local_copy(target_folder=args.clearml_dataset_loc, max_workers=1, overwrite=True)

if args.X is not None:
    input_files = make_file_list(target_folder, [".bin", ".mp4"]) `It seems to fail on / after the  ` shutil.copy() `  between the cache and the target folder. I've watched that folder from shelling into the pod, and the files seem to copy over fine. But something goes wrong either upon completion or during that execution which causes my pod to exit with error 137. Any thoughts at all?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					NaughtyFish36
				
					0
					 × 1

Today I’m OOO but I. An give an initial suggestion: when dealing with resource usage issues logs are important but metrics can help a lot more. If you don’t have it, install a Grafana stack so we can see resource metric history before we got oom . This helps to understand if we are really using a lot of RAM ore the problem is somewhere else.

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					JuicyFox94
				
					0
					 × 1

Hi NaughtyFish36 , must admit this is new to me. Perhaps JuicyFox94 has an idea, however I think he's not available today. Can you perhaps try to attach some logs and more details?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Write your answer

2K Views

4 Answers

2 years ago