Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hi, If I Am Starting My Training With The Following Command:

Hi, if I am starting my training with the following command:
python -u -m torch.distributed.launch --nproc_per_node=2 --use_env train.py --config configs/train.yamlAnd train.py creates a Task, will I be able to start this task remotely (clone and enqueue from the interface) Ie. will ClearML be able to start the exact same command in an agent?

  
  
Posted 2 years ago
Votes Newest

Answers 30


Hi AgitatedDove14 , How should we proceed to fix this bug? Should I open an issue in github? Should I try to make a minimal reproducible example? It’s blocking me atm

  
  
Posted 2 years ago

I fixed, will push a fix in pytorch-ignite 🙂

  
  
Posted 2 years ago

Task.current_task().get_logger().flush(wait=True). # <-- WILL HANG HERE

Okay a bit of theoretical "how it actually works" (and I might be mistaken here...)
Console logging is being reported because the underlining DDP infra (gloo) is pipeline stdout to the main process, where clearml will catch it (I think) The scalars not working on the subprocesss & the flush wait stuck I think are related, as the wait actually waits for the flush process, and it seems it cannot actually "talk" to it, hence hanging and no logs.Three was a fix in te latest RC that solved a similar issue (basically forking race with internal python states). Can you try with clearml==1.1.5rc2 ?

  
  
Posted 2 years ago

yes

  
  
Posted 2 years ago

And I am wondering if only the main process (rank=0) should attach the ClearMLLogger or if all the processes within the node should do that

  
  
Posted 2 years ago

JitteryCoyote63 How can I reproduce it quickly?

  
  
Posted 2 years ago

I need to investigate further

  
  
Posted 2 years ago

Also, this is maybe a separate issue but could be linked, if I add Task.current_task().get_logger().flush(wait=True) like this:
def log_loss(engine): idist.barrier() device = idist.device() print("IDIST", device) from clearml import Task Task.current_task().get_logger().report_text(f"{device}, FIRED, {engine.state.iteration}, {engine.state.metrics}") Task.current_task().get_logger().report_scalar("train", "loss", engine.state.metrics["loss"], engine.state.iteration) Task.current_task().get_logger().flush(wait=True). # <-- WILL HANG HERE Task.current_task().get_logger().report_text(f"{device}, DONE REPORTING")Then the program freezes and I have to abort manually. With wait=False it doesn’t freeze, but still doesn’t report the scalars

  
  
Posted 2 years ago

If I call explicitly 

task.get_logger().report_scalar("test", str(parse_args.local_rank), 1., 0)

 , this will log as expected one value per process, so reporting works

JitteryCoyote63 and do prints get logged as well (from all processes) ?

  
  
Posted 2 years ago

Hi JitteryCoyote63
Somehow I thought it was solved 😞
1 ) Yes please add GitHub issue so we can keep track
2 )

Task.current_task().get_logger().flush(wait=True). # <-- WILL HANG HERE

Is this the main issue ?

  
  
Posted 2 years ago

So probably only the main process (rank=0) should attach the ClearMLLogger?

  
  
Posted 2 years ago

JitteryCoyote63 maybe this is an old example of the pytrorch ddp code? it is basically copy pasted from the pytorch website:
https://pytorch.org/tutorials/intermediate/dist_tuto.html

  
  
Posted 2 years ago

The main issue is the task_logger.report_scalar() not reporting the scalars

  
  
Posted 2 years ago

AgitatedDove14 If I call explicitly task.get_logger().report_scalar("test", str(parse_args.local_rank), 1., 0) , this will log as expected one value per process, so reporting works

  
  
Posted 2 years ago

Yes 😞 😄

  
  
Posted 2 years ago

Thanks JitteryCoyote63 , once we have a reproducible example the fix should be very quick to push (with these things reproducing it is the challenge)

  
  
Posted 2 years ago

Hi AgitatedDove14 , I investigated further and got rid of a separate bug. I was able to get ignite’s events fired, but still no scalars logged 😞
There is definitely something wrong going on with the reporting of scalars using multi processes, because if my ignite callback is the following:

def log_loss(engine): idist.barrier(). # Sync all processes device = idist.device() print("IDIST", device) from clearml import Task Task.current_task().get_logger().report_text(f"{device}, FIRED, {engine.state.iteration}, {engine.state.metrics}") Task.current_task().get_logger().report_scalar("train", "loss", engine.state.metrics["loss"], engine.state.iteration) Task.current_task().get_logger().report_text(f"{device}, DONE REPORTING")Then all the reported texts are logged but not the scalars 🤔

  
  
Posted 2 years ago

ok, so even if that guy is attached, it doesn’t report the scalars

  
  
Posted 2 years ago

btw I see in the pytorch_distributed_example I see that you average_gradients , but from pytorch https://pytorch.org/tutorials/beginner/dist_overview.html it says:
DDP takes care of gradient communication to keep model replicas synchronized and overlaps it with the gradient computations to speed up training.

  
  
Posted 2 years ago

I am actually calling later in the start_training function the following:
with idist.Parallel(backend="nccl") as parallel: parallel.run(training_func)So my backend should be nccl and not gloo , right? Not sure how important it is, I read in the https://pytorch.org/docs/stable/distributed.html#which-backend-to-use that nccl should be used for distributed GPU training and gloo for distributed CPU training
I will try with clearml==1.1.5rc2

  
  
Posted 2 years ago

For the moment this is what I would be inclined to believe

  
  
Posted 2 years ago

And is Task.init called on all processes ?

  
  
Posted 2 years ago

Amazing! 🎉
Let me know how we can help 🙂

  
  
Posted 2 years ago

Yes, no reason to attach the second one (imho)

  
  
Posted 2 years ago

I opened an https://github.com/pytorch/ignite/issues/2343 in ignite’s repo and a https://github.com/pytorch/ignite/pull/2344 , could you please have a look? There might be a bug in clearml Task.init in distributed envs

  
  
Posted 2 years ago

now realise that the ignite events callbacks seem to not be fired

So this is an ignite issue ?

  
  
Posted 2 years ago

AgitatedDove14 Good news, I was able to reproduce the bug on the pytorch distributed sample 🤩
Here it is > https://github.com/H4dr1en/trains/commit/642c1130ad1f76db10ed9b8e1a4ff0fd7e45b3cc

  
  
Posted 2 years ago

AgitatedDove14 I think it’s on me to take the pytorch distributed example in the clearml repo and try to reproduce the bug, then pass it over to you 🙂

  
  
Posted 2 years ago

AgitatedDove14 yes! I now realise that the ignite events callbacks seem to not be fired (I tried to print a debug message on a custom Events.ITERATION_COMPLETED) and I cannot see it logged

  
  
Posted 2 years ago

AgitatedDove14 Same problem with clearml==1.1.5rc2 😞 , I also tried with backend==gloo , still same problem

  
  
Posted 2 years ago
829 Views
30 Answers
2 years ago
one year ago
Tags