What's the clearml version? Is this with the latest from GitHub?
Can you reproduce this behavior outside of lightning? or in a toy example (because I could not)
That works fine:1631895370729 vm-aimbrain-01 info ClearML Task: created new task id=cfed3ea8512d4d9f858d085bd79e62e8 2021-09-17 16:16:10,744 - clearml.Task - INFO - No repository found, storing script code instead ClearML results page:
1631895370892 vm-aimbrain-01 info start 1631895370896 vm-aimbrain-01 error 0%| | 0/100 [00:00<?, ?it/s] 1631895471026 vm-aimbrain-01 error 100%|█████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [01:40<00:00, 1.00s/it] 1631895471027 vm-aimbrain-01 info done
I don't control tqdm, (otherwise I would have already gone for Stef's suggestion) - pytorch-lightning does in this particular script 😞 .
I found out that the lightning trainer has a progress_bar_refresh_rate
argument (default set to 1) which produces the spamming logs. If I set that to 10, I get 1/10th of the spam (but a janky progress bar in the console). I could set it to 0 to disable it, but that's not really a fix. What I'd really want is the same behaviour in the console (one smooth progress bar) and one line per epoch in the logs; high hopes, right? 😊
` # Development mode worker
worker {
# Status report period in seconds
report_period_sec: 2
# ping to the server - check connectivity
ping_period_sec: 30
# Log all stdout & stderr
log_stdout: true
# Carriage return (\r) support. If zero (0) \r treated as \n and flushed to backend
# Carriage return flush support in seconds, flush consecutive line feeds (\r) every X (default: 10) seconds
console_cr_flush_period: 600
# compatibility feature, report memory usage for the entire machine
# default (false), report only on the running process and its sub-processes
report_global_mem_used: false
} `(excerpt)
In any case, do you have any suggestion of how I could at least hack tqdm to make it behave? Thanks
I think I know what the issue is, it seems tqdm is using Unicode for the CR this is the 1b 5b 41
sequence I see on the binary log.
Let me see if I can hack something for you to test 🙂
I might have found it, tqdm is sending{ 1b 5b 41 } unicode arrow up?
https://github.com/horovod/horovod/issues/2367
In case anyone is interested, the minimum effort workaround I found is to edit pytorch_lightning/callbacks/progress.py
and change all occurrences of dynamic_ncols=True
to dynamic_ncols=False
in the calls to tqdm
. One could of course implement a custom callback inheriting from their ProgressBar
class.
Okay, some progress, so what is the difference ?
Any chance the issue can be reproduced with a small toy code ?
Can you run the tqdm loop inside the code that exhibits the CR issue ? (maybe some initialization thing that is causing it to ignore the value?!)
Hi WittyOwl57 , that is awesome fix! what does "dynamic_ncols" change?
it seems like a tqdm parameter, not sure what it does ...
WittyOwl57 this is what I'm getting on my console (Notice New lines! not a single one overwritten as I would expect)
` Training: 10%|█ | 1/10 [00:00<?, ?it/s]
Training: 20%|██ | 2/10 [00:00<00:00, 9.93it/s]
Training: 30%|███ | 3/10 [00:00<00:00, 9.89it/s]
Training: 40%|████ | 4/10 [00:00<00:00, 9.87it/s]
Training: 50%|█████ | 5/10 [00:00<00:00, 9.87it/s]
Training: 60%|██████ | 6/10 [00:00<00:00, 9.88it/s]
Training: 70%|███████ | 7/10 [00:00<00:00, 9.88it/s]
Training: 80%|████████ | 8/10 [00:00<00:00, 9.88it/s]
Training: 90%|█████████ | 9/10 [00:00<00:00, 9.87it/s]
Training: 100%|██████████| 10/10 [00:00<00:00, 9.87it/s]
Training: 11it [00:01, 9.85it/s]
done
Process finished with exit code 0 `
This is adapted from one of the methods in their ProgressBar
classfrom tqdm import tqdm bar = tqdm( desc="Training", initial=1, position=1, disable=False, leave=False, dynamic_ncols=True, file=sys.stderr, smoothing=0) with bar: for i in range(10): time.sleep(0.1) bar.update() print('done')
In the console this works as expected, but in a jupyter notebook this produces a scrolling log (because of the position=1 argument, which happens whenever the bar is not the main bar).
If I add the task before the call, the issue is fully reproduced. Finally!2021-10-07 23:50:06 ClearML Task: created new task id=38cc10401fcc43cfa432b7ceed7df0cc 2021-10-07 22:50:06,194 - clearml.Task - INFO - No repository found, storing script code instead ClearML results page:
2021-10-07 23:50:06 Training: 10%|█████████████████ | 1/10 [00:00<?, ?it/s] A 2021-10-07 23:50:06 ClearML new version available: upgrade to v1.1.2 is recommended! 2021-10-07 23:50:06 Training: 20%|████████████████████████████████▌ | 2/10 [00:00<00:00, 9.95it/s]A Training: 30%|████████████████████████████████████████████████▉ | 3/10 [00:00<00:00, 9.92it/s]A Training: 40%|█████████████████████████████████████████████████████████████████▏ | 4/10 [00:00<00:00, 9.90it/s]A Training: 50%|█████████████████████████████████████████████████████████████████████████████████▌ | 5/10 [00:00<00:00, 9.90it/s]A Training: 60%|█████████████████████████████████████████████████████████████████████████████████████████████████▊ | 6/10 [00:00<00:00, 9.91it/s]A Training: 70%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████ | 7/10 [00:00<00:00, 9.91it/s]A Training: 80%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍ | 8/10 [00:00<00:00, 9.91it/s]A Training: 90%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋ | 9/10 [00:00<00:00, 9.91it/s]A Training: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 9.91it/s]A 2021-10-07 23:50:07 done 2021-10-07 23:50:07 2021-10-07 22:50:07,430 - clearml.Task - INFO - Waiting for repository detection and full package requirement analysis 2021-10-07 23:50:07 Training: 11it [00:01, 9.91it/s] 2021-10-07 23:50:12 2021-10-07 22:50:12,286 - clearml.Task - INFO - Finished repository detection and package analysis
This is the full script:
` import sys
from clearml import Task
import time
from tqdm import tqdm
task = Task.init(project_name='debug', task_name='test tqdm cr cl reproduced finally!')
bar = tqdm(
total=10,
desc="Training",
initial=1,
position=1,
disable=False,
leave=True,
dynamic_ncols=True,
file=sys.stderr,
smoothing=0)
with bar:
for i in range(10):
time.sleep(0.1)
bar.update()
print('done') `
The UI shows the log as is (and as pasted above). In the console I'm getting correct output (a single tqdm progress line):[2021-09-17 13:29:51,860][pytorch_lightning.utilities.distributed][INFO] - GPU available: True, used: True [2021-09-17 13:29:51,862][pytorch_lightning.utilities.distributed][INFO] - TPU available: False, using: 0 TPU cores [2021-09-17 13:29:51,862][pytorch_lightning.utilities.distributed][INFO] - IPU available: False, using: 0 IPUs [2021-09-17 13:29:51,866][pytorch_lightning.accelerators.gpu][INFO] - LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [3] Epoch 34: 100%|███████████████████████████████████████████████████████████| 361/361 [02:05<00:00, 2.89it/s, loss=0.151, v_num=9-29]
task = Task.init(project_name='debug', task_name='test tqdm cr cl') print('start') for i in tqdm.tqdm(range(100), dynamic_ncols=True,): sleep(1) print('done')
This code snippet works as expected (console will show the progress at the flush interval without values in between). What's the difference ?!
What I'd really want is the same behaviour in the console (one smooth progress bar) and one line per epoch in the logs; high hopes, right?
I think they send some "odd" character instead of CR, otherwise I cannot explain the difference.
Can you point to a toy example demonstrating the same issue ?
Also I just tried the pytorch-lightning
RichProgressBar
(not yet released) instead of the default (which is unfortunately based on tqdm) and it works great.
Yey!
Interesting, I don't get newlines in any of my consoles:ClearML Task: overwriting (reusing) task id=38cc10401fcc43cfa432b7ceed7df0cc 2021-10-08 14:57:53,704 - clearml.Task - INFO - No repository found, storing script code instead ClearML results page:
ClearML new version available: upgrade to v1.1.2 is recommended!███████████████████████████████████████████████████████████▏ | 4/10 [00:00<00:00, 9.80it/s] doneining: 11it [00:01, 9.87it/s] Training: 11it [00:01, 9.87it/s] 2021-10-08 14:57:54,949 - clearml.Task - INFO - Waiting for repository detection and full package requirement analysis 2021-10-08 14:57:59,936 - clearml.Task - INFO - Finished repository detection and package analysis
or (after install in the new clearml to get rid of the log line):ClearML Task: overwriting (reusing) task id=38cc10401fcc43cfa432b7ceed7df0cc 2021-10-08 14:59:03,762 - clearml.Task - INFO - No repository found, storing script code instead ClearML results page:
doneining: 11it [00:01, 9.89it/s] Training: 11it [00:01, 9.89it/s] 2021-10-08 14:59:05,003 - clearml.Task - INFO - Waiting for repository detection and full package requirement analysis 2021-10-08 14:59:09,722 - clearml.Task - INFO - Finished repository detection and package analysis
In any case, do you have any suggestion of how I could at least hack tqdm to make it behave? Thanks 🙏
I'm been overly optimistic it seems (based on the hints from this issue: https://stackoverflow.com/questions/59455268/how-to-disable-progress-bar-in-pytorch-lightning ). It really looked like it worked. But I was mislead by the fact that the issue has a peculiar pattern to it: it doesn't appear at the beginning of each epoch. There seem to be a couple of extra cr / flushes at the beginning and then a (puzzling) extra "Validation" log line seems to trigger the continuous scrolling. Anyway, my bad,, that doesn't fix things 😢 .
` from time import sleep
from clearml import Task
import tqdm
task = Task.init(project_name='debug', task_name='test tqdm cr cl')
print('start')
for i in tqdm.tqdm(range(100)):
sleep(1)
print('done') `The above example code will output a line every 10 seconds (with the default console_cr_flush_period=10) , can you verify it works for you?
I mean just add the toy tqdm loop somewhere just before starting the lightning train function. I just want to verify that it works, or maybe there is something in the specific setup happening in real-time that changes it
Hi Martin, it is a tqdm parameter (the default ProgressBar
in pytorch lightning is unfortunately relying on tqdm). This is from the tqdm docs:dynamic_ncols : bool, optional If set, constantly alters
ncolsand
nrowsto the environment (allowing for window resizes) [default: False]. nrows : int, optional The screen height. If specified, hides nested bars outside this bound. If unspecified, attempts to use environment height. The fallback is 20. ncols : int, optional The width of the entire output message. If specified, dynamically resizes the progressbar to stay within this bound. If unspecified, attempts to use environment width. The fallback is a meter width of 10 and no limit for the counter and statistics. If 0, will not print any meter (only stats).
Also I just tried the pytorch-lightning RichProgressBar
(not yet released) instead of the default (which is unfortunately based on tqdm) and it works great.
Could it be you have two entries of "console_cr_flush_period" ?
WittyOwl57 I can verify the issue reproduces! 🎉 !
And I know what happens, TQDM is sending an "up arrow" key, if you are running inside bash, that looks like CR (i.e. move the cursor to the begining of the line), but when running inside other terminals (like PyCharm or ClearML log) this "arrow-key" is just unicode character to print, it does nothing, and we end up with multiple lines.
Let me see if we can fix it 🙂
radu on vm-aimbrain-01 in experiments/runners/all via :snake: v3.8.5 via C volt ❯ grep flush ~/clearml.conf # Carriage return (\r) support. If zero (0) \r treated as \n and flushed to backend # Carriage return flush support in seconds, flush consecutive line feeds (\r) every X (default: 10) seconds console_cr_flush_period: 600