Hi, If I Am Starting My Training With The Following Command:

Answered

Hi, if I am starting my training with the following command:
python -u -m torch.distributed.launch --nproc_per_node=2 --use_env train.py --config configs/train.yamlAnd train.py creates a Task, will I be able to start this task remotely (clone and enqueue from the interface) Ie. will ClearML be able to start the exact same command in an agent?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

Votes Newest

Answers 30

And I am wondering if only the main process (rank=0) should attach the ClearMLLogger or if all the processes within the node should do that

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

So probably only the main process (rank=0) should attach the ClearMLLogger?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

Yes, no reason to attach the second one (imho)

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

ok, so even if that guy is attached, it doesn’t report the scalars

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

I need to investigate further

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

And is Task.init called on all processes ?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

yes

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

AgitatedDove14 If I call explicitly task.get_logger().report_scalar("test", str(parse_args.local_rank), 1., 0) , this will log as expected one value per process, so reporting works

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

btw I see in the pytorch_distributed_example I see that you average_gradients , but from pytorch https://pytorch.org/tutorials/beginner/dist_overview.html it says:
DDP takes care of gradient communication to keep model replicas synchronized and overlaps it with the gradient computations to speed up training.

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

JitteryCoyote63 maybe this is an old example of the pytrorch ddp code? it is basically copy pasted from the pytorch website:
https://pytorch.org/tutorials/intermediate/dist_tuto.html

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

If I call explicitly

task.get_logger().report_scalar("test", str(parse_args.local_rank), 1., 0)

, this will log as expected one value per process, so reporting works

JitteryCoyote63 and do prints get logged as well (from all processes) ?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

AgitatedDove14 yes! I now realise that the ignite events callbacks seem to not be fired (I tried to print a debug message on a custom Events.ITERATION_COMPLETED) and I cannot see it logged

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

now realise that the ignite events callbacks seem to not be fired

So this is an ignite issue ?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

For the moment this is what I would be inclined to believe

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

Hi AgitatedDove14 , I investigated further and got rid of a separate bug. I was able to get ignite’s events fired, but still no scalars logged 😞
There is definitely something wrong going on with the reporting of scalars using multi processes, because if my ignite callback is the following:

def log_loss(engine): idist.barrier(). # Sync all processes device = idist.device() print("IDIST", device) from clearml import Task Task.current_task().get_logger().report_text(f"{device}, FIRED, {engine.state.iteration}, {engine.state.metrics}") Task.current_task().get_logger().report_scalar("train", "loss", engine.state.metrics["loss"], engine.state.iteration) Task.current_task().get_logger().report_text(f"{device}, DONE REPORTING")Then all the reported texts are logged but not the scalars 🤔

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

Also, this is maybe a separate issue but could be linked, if I add Task.current_task().get_logger().flush(wait=True) like this:
def log_loss(engine): idist.barrier() device = idist.device() print("IDIST", device) from clearml import Task Task.current_task().get_logger().report_text(f"{device}, FIRED, {engine.state.iteration}, {engine.state.metrics}") Task.current_task().get_logger().report_scalar("train", "loss", engine.state.metrics["loss"], engine.state.iteration) Task.current_task().get_logger().flush(wait=True). # <-- WILL HANG HERE Task.current_task().get_logger().report_text(f"{device}, DONE REPORTING")Then the program freezes and I have to abort manually. With wait=False it doesn’t freeze, but still doesn’t report the scalars

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

Task.current_task().get_logger().flush(wait=True). # <-- WILL HANG HERE

Okay a bit of theoretical "how it actually works" (and I might be mistaken here...)
Console logging is being reported because the underlining DDP infra (gloo) is pipeline stdout to the main process, where clearml will catch it (I think) The scalars not working on the subprocesss & the flush wait stuck I think are related, as the wait actually waits for the flush process, and it seems it cannot actually "talk" to it, hence hanging and no logs.Three was a fix in te latest RC that solved a similar issue (basically forking race with internal python states). Can you try with clearml==1.1.5rc2 ?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

I am actually calling later in the start_training function the following:
with idist.Parallel(backend="nccl") as parallel: parallel.run(training_func)So my backend should be nccl and not gloo , right? Not sure how important it is, I read in the https://pytorch.org/docs/stable/distributed.html#which-backend-to-use that nccl should be used for distributed GPU training and gloo for distributed CPU training
I will try with clearml==1.1.5rc2

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

AgitatedDove14 Same problem with clearml==1.1.5rc2 😞 , I also tried with backend==gloo , still same problem

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

Hi AgitatedDove14 , How should we proceed to fix this bug? Should I open an issue in github? Should I try to make a minimal reproducible example? It’s blocking me atm

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

Hi JitteryCoyote63
Somehow I thought it was solved 😞
1 ) Yes please add GitHub issue so we can keep track
2 )

Task.current_task().get_logger().flush(wait=True). # <-- WILL HANG HERE

Is this the main issue ?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

The main issue is the task_logger.report_scalar() not reporting the scalars

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

JitteryCoyote63 How can I reproduce it quickly?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

AgitatedDove14 I think it’s on me to take the pytorch distributed example in the clearml repo and try to reproduce the bug, then pass it over to you 🙂

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

Thanks JitteryCoyote63 , once we have a reproducible example the fix should be very quick to push (with these things reproducing it is the challenge)

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Yes 😞 😄

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

AgitatedDove14 Good news, I was able to reproduce the bug on the pytorch distributed sample 🤩
Here it is > https://github.com/H4dr1en/trains/commit/642c1130ad1f76db10ed9b8e1a4ff0fd7e45b3cc

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

I fixed, will push a fix in pytorch-ignite 🙂

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

Amazing! 🎉
Let me know how we can help 🙂

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

I opened an https://github.com/pytorch/ignite/issues/2343 in ignite’s repo and a https://github.com/pytorch/ignite/pull/2344 , could you please have a look? There might be a bug in clearml Task.init in distributed envs

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

Write your answer

881 Views

30 Answers

2 years ago

one year ago