Unanswered
Hi, If I Am Starting My Training With The Following Command:
Also, this is maybe a separate issue but could be linked, if I add Task.current_task().get_logger().flush(wait=True)
like this:def log_loss(engine): idist.barrier() device = idist.device() print("IDIST", device) from clearml import Task Task.current_task().get_logger().report_text(f"{device}, FIRED, {engine.state.iteration}, {engine.state.metrics}") Task.current_task().get_logger().report_scalar("train", "loss", engine.state.metrics["loss"], engine.state.iteration) Task.current_task().get_logger().flush(wait=True). # <-- WILL HANG HERE Task.current_task().get_logger().report_text(f"{device}, DONE REPORTING")
Then the program freezes and I have to abort manually. With wait=False
it doesn’t freeze, but still doesn’t report the scalars
131 Views
0
Answers
2 years ago
one year ago