For the moment this is what I would be inclined to believe
Amazing! 🎉
Let me know how we can help 🙂
I fixed, will push a fix in pytorch-ignite 🙂
Also, this is maybe a separate issue but could be linked, if I add Task.current_task().get_logger().flush(wait=True)
like this:def log_loss(engine): idist.barrier() device = idist.device() print("IDIST", device) from clearml import Task Task.current_task().get_logger().report_text(f"{device}, FIRED, {engine.state.iteration}, {engine.state.metrics}") Task.current_task().get_logger().report_scalar("train", "loss", engine.state.metrics["loss"], engine.state.iteration) Task.current_task().get_logger().flush(wait=True). # <-- WILL HANG HERE Task.current_task().get_logger().report_text(f"{device}, DONE REPORTING")
Then the program freezes and I have to abort manually. With wait=False
it doesn’t freeze, but still doesn’t report the scalars
now realise that the ignite events callbacks seem to not be fired
So this is an ignite issue ?
Task.current_task().get_logger().flush(wait=True). # <-- WILL HANG HERE
Okay a bit of theoretical "how it actually works" (and I might be mistaken here...)
Console logging is being reported because the underlining DDP infra (gloo) is pipeline stdout to the main process, where clearml will catch it (I think) The scalars not working on the subprocesss & the flush wait stuck I think are related, as the wait actually waits for the flush process, and it seems it cannot actually "talk" to it, hence hanging and no logs.Three was a fix in te latest RC that solved a similar issue (basically forking race with internal python states). Can you try with clearml==1.1.5rc2
?
AgitatedDove14 Same problem with clearml==1.1.5rc2
😞 , I also tried with backend==gloo
, still same problem
Thanks JitteryCoyote63 , once we have a reproducible example the fix should be very quick to push (with these things reproducing it is the challenge)
AgitatedDove14 yes! I now realise that the ignite events callbacks seem to not be fired (I tried to print a debug message on a custom Events.ITERATION_COMPLETED) and I cannot see it logged
I opened an https://github.com/pytorch/ignite/issues/2343 in ignite’s repo and a https://github.com/pytorch/ignite/pull/2344 , could you please have a look? There might be a bug in clearml Task.init
in distributed envs
If I call explicitlyÂ
task.get_logger().report_scalar("test", str(parse_args.local_rank), 1., 0)
 , this will log as expected one value per process, so reporting works
JitteryCoyote63 and do prints get logged as well (from all processes) ?
JitteryCoyote63 How can I reproduce it quickly?
So probably only the main process (rank=0) should attach the ClearMLLogger?
ok, so even if that guy is attached, it doesn’t report the scalars
Hi JitteryCoyote63
Somehow I thought it was solved 😞
1 ) Yes please add GitHub issue so we can keep track
2 )
Task.current_task().get_logger().flush(wait=True). # <-- WILL HANG HERE
Is this the main issue ?
AgitatedDove14 Good news, I was able to reproduce the bug on the pytorch distributed sample 🤩
Here it is > https://github.com/H4dr1en/trains/commit/642c1130ad1f76db10ed9b8e1a4ff0fd7e45b3cc
The main issue is the task_logger.report_scalar()
not reporting the scalars
AgitatedDove14 I think it’s on me to take the pytorch distributed example in the clearml repo and try to reproduce the bug, then pass it over to you 🙂
And is Task.init called on all processes ?
Yes, no reason to attach the second one (imho)
Hi AgitatedDove14 , I investigated further and got rid of a separate bug. I was able to get ignite’s events fired, but still no scalars logged 😞
There is definitely something wrong going on with the reporting of scalars using multi processes, because if my ignite callback is the following:
def log_loss(engine): idist.barrier(). # Sync all processes device = idist.device() print("IDIST", device) from clearml import Task Task.current_task().get_logger().report_text(f"{device}, FIRED, {engine.state.iteration}, {engine.state.metrics}") Task.current_task().get_logger().report_scalar("train", "loss", engine.state.metrics["loss"], engine.state.iteration) Task.current_task().get_logger().report_text(f"{device}, DONE REPORTING")
Then all the reported texts are logged but not the scalars 🤔
I am actually calling later in the start_training
function the following:with idist.Parallel(backend="nccl") as parallel: parallel.run(training_func)
So my backend should be nccl
and not gloo
, right? Not sure how important it is, I read in the https://pytorch.org/docs/stable/distributed.html#which-backend-to-use that nccl should be used for distributed GPU training and gloo for distributed CPU training
I will try with clearml==1.1.5rc2
JitteryCoyote63 maybe this is an old example of the pytrorch ddp code? it is basically copy pasted from the pytorch website:
https://pytorch.org/tutorials/intermediate/dist_tuto.html
Hi AgitatedDove14 , How should we proceed to fix this bug? Should I open an issue in github? Should I try to make a minimal reproducible example? It’s blocking me atm
And I am wondering if only the main process (rank=0) should attach the ClearMLLogger or if all the processes within the node should do that
AgitatedDove14 If I call explicitly task.get_logger().report_scalar("test", str(parse_args.local_rank), 1., 0)
, this will log as expected one value per process, so reporting works
btw I see in the pytorch_distributed_example I see that you average_gradients
, but from pytorch https://pytorch.org/tutorials/beginner/dist_overview.html it says:DDP takes care of gradient communication to keep model replicas synchronized and overlaps it with the gradient computations to speed up training.