Hi, I'M Using Clearml'S Hosted Free Saas Offering. I'M Running Model Training In Pytorch On A Server And Pushing Metrics To Cml. I'Ve Noticed That Anytime My Training Job Fails Due To Gpu Oom Issues, Cml Marks The Job As

Unanswered

AgitatedDove14 finally had a chance to properly look into it and I think I know what's going on
When running any task with hydra, hydra wraps the called method in its own https://github.com/facebookresearch/hydra/blob/a559aa4bf6807d5e3a82e065987825fa322351e2/hydra/_internal/utils.py#L211 . When the task throws any exception, it triggers the except block of this method which handles the exception.
CML marks a task as failed only if the whatever exception the task generated was not handled and the task exited abruptly because it uses the sys.excepthook=self.exc_handler .
However, in this scenario, since the exceptions is handled by hydra (and it always will be if hydra is used), the exc_handler method, that's used by CML to determine if there was an exception, is never called because it was attached to the sys.excepthook ,which doesn't get triggered, and therefore CML sees it as no exception.

I think CML needs a better way of determining if there was an exception rather than hoping that the code doesn't catch the exception because in most good production systems, the exits will generally be gracefully handled. I'm modifying my own code base atm to do this. CML will never be able to detect the exception in such cases.

Maybe instead of relying on the system's excepthook, y'all can hook in a method at exit which will look for tracebacks and exception messages to determine if the code has terminated due to some error. Just throwing out an idea off the top of my head.

But generally IMO, CML should have a better approach for detecting errors and updating task statuses correctly.

Hope this helps.

  				
Posted 
	3 years ago

					More  		
  Report
		
					JumpyClams73
				
					0
					 × 1

233 Views

0 Answers

3 years ago

2 years ago