And I am wondering if only the main process (rank=0) should attach the ClearMLLogger or if all the processes within the node should do that
So probably only the main process (rank=0) should attach the ClearMLLogger?
Yes, no reason to attach the second one (imho)
ok, so even if that guy is attached, it doesn’t report the scalars
And is Task.init called on all processes ?
AgitatedDove14 If I call explicitly task.get_logger().report_scalar("test", str(parse_args.local_rank), 1., 0)
, this will log as expected one value per process, so reporting works
btw I see in the pytorch_distributed_example I see that you average_gradients
, but from pytorch https://pytorch.org/tutorials/beginner/dist_overview.html it says:DDP takes care of gradient communication to keep model replicas synchronized and overlaps it with the gradient computations to speed up training.
JitteryCoyote63 maybe this is an old example of the pytrorch ddp code? it is basically copy pasted from the pytorch website:
https://pytorch.org/tutorials/intermediate/dist_tuto.html
If I call explicitly
task.get_logger().report_scalar("test", str(parse_args.local_rank), 1., 0)
, this will log as expected one value per process, so reporting works
JitteryCoyote63 and do prints get logged as well (from all processes) ?
AgitatedDove14 yes! I now realise that the ignite events callbacks seem to not be fired (I tried to print a debug message on a custom Events.ITERATION_COMPLETED) and I cannot see it logged
now realise that the ignite events callbacks seem to not be fired
So this is an ignite issue ?
For the moment this is what I would be inclined to believe
Hi AgitatedDove14 , I investigated further and got rid of a separate bug. I was able to get ignite’s events fired, but still no scalars logged 😞
There is definitely something wrong going on with the reporting of scalars using multi processes, because if my ignite callback is the following:
def log_loss(engine): idist.barrier(). # Sync all processes device = idist.device() print("IDIST", device) from clearml import Task Task.current_task().get_logger().report_text(f"{device}, FIRED, {engine.state.iteration}, {engine.state.metrics}") Task.current_task().get_logger().report_scalar("train", "loss", engine.state.metrics["loss"], engine.state.iteration) Task.current_task().get_logger().report_text(f"{device}, DONE REPORTING")
Then all the reported texts are logged but not the scalars 🤔
Also, this is maybe a separate issue but could be linked, if I add Task.current_task().get_logger().flush(wait=True)
like this:def log_loss(engine): idist.barrier() device = idist.device() print("IDIST", device) from clearml import Task Task.current_task().get_logger().report_text(f"{device}, FIRED, {engine.state.iteration}, {engine.state.metrics}") Task.current_task().get_logger().report_scalar("train", "loss", engine.state.metrics["loss"], engine.state.iteration) Task.current_task().get_logger().flush(wait=True). # <-- WILL HANG HERE Task.current_task().get_logger().report_text(f"{device}, DONE REPORTING")
Then the program freezes and I have to abort manually. With wait=False
it doesn’t freeze, but still doesn’t report the scalars
Task.current_task().get_logger().flush(wait=True). # <-- WILL HANG HERE
Okay a bit of theoretical "how it actually works" (and I might be mistaken here...)
Console logging is being reported because the underlining DDP infra (gloo) is pipeline stdout to the main process, where clearml will catch it (I think) The scalars not working on the subprocesss & the flush wait stuck I think are related, as the wait actually waits for the flush process, and it seems it cannot actually "talk" to it, hence hanging and no logs.Three was a fix in te latest RC that solved a similar issue (basically forking race with internal python states). Can you try with clearml==1.1.5rc2
?
I am actually calling later in the start_training
function the following:with idist.Parallel(backend="nccl") as parallel: parallel.run(training_func)
So my backend should be nccl
and not gloo
, right? Not sure how important it is, I read in the https://pytorch.org/docs/stable/distributed.html#which-backend-to-use that nccl should be used for distributed GPU training and gloo for distributed CPU training
I will try with clearml==1.1.5rc2
AgitatedDove14 Same problem with clearml==1.1.5rc2
😞 , I also tried with backend==gloo
, still same problem
Hi AgitatedDove14 , How should we proceed to fix this bug? Should I open an issue in github? Should I try to make a minimal reproducible example? It’s blocking me atm
Hi JitteryCoyote63
Somehow I thought it was solved 😞
1 ) Yes please add GitHub issue so we can keep track
2 )
Task.current_task().get_logger().flush(wait=True). # <-- WILL HANG HERE
Is this the main issue ?
The main issue is the task_logger.report_scalar()
not reporting the scalars
JitteryCoyote63 How can I reproduce it quickly?
AgitatedDove14 I think it’s on me to take the pytorch distributed example in the clearml repo and try to reproduce the bug, then pass it over to you 🙂
Thanks JitteryCoyote63 , once we have a reproducible example the fix should be very quick to push (with these things reproducing it is the challenge)
AgitatedDove14 Good news, I was able to reproduce the bug on the pytorch distributed sample 🤩
Here it is > https://github.com/H4dr1en/trains/commit/642c1130ad1f76db10ed9b8e1a4ff0fd7e45b3cc
I fixed, will push a fix in pytorch-ignite 🙂
I opened an https://github.com/pytorch/ignite/issues/2343 in ignite’s repo and a https://github.com/pytorch/ignite/pull/2344 , could you please have a look? There might be a bug in clearml Task.init
in distributed envs