Unanswered
Hello, I Would Like To Use Spot Instances Together With The Aws Autoscaler To Train Models With Pytorch/Ignite And I Am Wondering How To Support Interruptions During The Training (In Case The Instance Is Terminated By Aws). Is There Anything Already Built
btw I monkey patched ignite’s function global_step_from_engine
to print the iteration and passed the modified function to the ClearMLLogger.attach_output_handler(…, global_step_transform=patched_global_step_from_engine(engine))
. It prints the correct iteration number when calling ClearMLLogger.OutputHandler.__ call__ .
` def call(self, engine: Engine, logger: ClearMLLogger, event_name: Union[str, Events]) -> None:
if not isinstance(logger, ClearMLLogger):
raise RuntimeError("Handler OutputHandler works only with ClearMLLogger")
metrics = self._setup_output_metrics(engine)
global_step = self.global_step_transform(engine, event_name) # type: ignore[misc]
if not isinstance(global_step, int):
raise TypeError(
f"global_step must be int, got {type(global_step)}."
" Please check the output of global_step_transform."
)
for key, value in metrics.items():
if isinstance(value, numbers.Number) or isinstance(value, torch.Tensor) and value.ndimension() == 0:
logger.clearml_logger.report_scalar(title=self.tag, series=key, iteration=global_step, value=value)
elif isinstance(value, torch.Tensor) and value.ndimension() == 1:
for i, v in enumerate(value):
logger.clearml_logger.report_scalar(
title=f"{self.tag}/{key}", series=str(i), iteration=global_step, value=v.item()
)
else:
warnings.warn(f"ClearMLLogger output_handler can not log metrics value type {type(value)}") `I don’t understand how it can log a wrong iteration if the ` global_step ` var has the right value in this function
174 Views
0
Answers
3 years ago
one year ago