Yea I am fine not having the console logging. My issues is the scalers and debug images occasionally don't record to ClearML
Thanks @<1719524641879363584:profile|ThankfulClams64> having a code that can reproduce it is exactly what we need.
One thing I might have missed and is very important , what is your tensorboard package version?
@<1719524641879363584:profile|ThankfulClams64> , are logs showing up without issue on the 'problematic' machine?
Yes it is logging to the console. The script does hang whenever it completes all the epochs when it is having the issue.
The machine currently having the issue is on tensorboard==2.16.2
Hi @<1719524641879363584:profile|ThankfulClams64> ,the logging is done by a separate process, I'm pretty sure it's not terminating all of the sudden. Did you manage to get a full log of such an experiment to share?
task.connect(model_config)
task.connect(DataAugConfig)
If these are separate dictionaries , you should probably use two sections:
task.connect(model_config, name="model config")
task.connect(DataAugConfig, name="data aug")
It is still getting stuck.
I notice that one of the scalars that gets logged early is logging the epoch while the remaining scalars seem to be iterations because the iteration value is 1355 instead of 26
wait so you are seeing Some scalars ?
while the remaining scalars seem to be iterations because the iteration value is 1355 instead of 26
what are you seeing in your TB?
@<1719524641879363584:profile|ThankfulClams64> , if you set auto_connect_streams to false nothing will be reported from your frameworks. With what frameworks are you working, tensorboard?
Can you try with auto_connect_streams=True ? Also, what version of clearml sdk are you using?
I'll update my clearml version. Unfortunately I do not have a small code snippet and it is not always repeatable. Is there some additional logging that can be turned on?
So I am only seeing values for the first epoch. It seems like it does not track all of them so maybe something is happening when it tries to log scalars.
I have seen it only log iterations but setting task.set_initial_iteration(0) seemed to fix that so it now seems to be logging the correct epoch
Tensorboard is correct and works. I have never seen an issue in the tensorboard logs
Hi @<1719524641879363584:profile|ThankfulClams64> , stopping all processes should do that, there is no programmatic way of doing that specifically. Did you try calling task.close() for all tasks you're using?
We are running the same code on multiple machines and it just randomly happens. Currently we are having the issue on 1 out of 4
Yes I see it in the terminal on the machine
I will try with clearml==1.16.3rc2 and see if it still has the issue
I am using 1.15.0. Yes I can try with auto_connect_streams set to True I believe I will still have the issue
Not sure if this is helpful but this is what I get when I cntrl-c out of the hung script
^C^CException ignored in atexit callback: <bound method Reporter._handle_program_exit of <clearml.backend_interface.metrics.reporter.Reporter object at 0x70fd8b7ff1c0>>
Event reporting sub-process lost, switching to thread based reporting
Traceback (most recent call last):
File "/home/richard/.virtualenvs/temp_clearml/lib/python3.10/site-packages/clearml/backend_interface/metrics/reporter.py", line 317, in _handle_program_exit
self.wait_for_events()
File "/home/richard/.virtualenvs/temp_clearml/lib/python3.10/site-packages/clearml/backend_interface/metrics/reporter.py", line 337, in wait_for_events
return report_service.wait_for_events(timeout=timeout)
File "/home/richard/.virtualenvs/temp_clearml/lib/python3.10/site-packages/clearml/backend_interface/metrics/reporter.py", line 129, in wait_for_events
if self._empty_state_event.wait(timeout=1.0):
File "/home/richard/.virtualenvs/temp_clearml/lib/python3.10/site-packages/clearml/utilities/process/mp.py", line 445, in wait
return self._event.wait(timeout=timeout)
File "/usr/lib/python3.10/multiprocessing/synchronize.py", line 349, in wait
self._cond.wait(timeout)
File "/usr/lib/python3.10/multiprocessing/synchronize.py", line 261, in wait
return self._wait_semaphore.acquire(True, timeout)
KeyboardInterrupt:
I do have uncommitted code changes. I can try to check at some point if it would not have the problem without them. It seems like it could be repeated just by making a git repo with that script and adding a very large file. If I can repeat it is it best to open an issue in GitHub?
I just created a new virtual environment and the problem persists. There are only two dependencies clearml and tensorflow. @<1523701070390366208:profile|CostlyOstrich36> what logs are you referring to?
@<1719524641879363584:profile|ThankfulClams64> , can you provide a small code snippet that reproduces this behaviour? Can you also test with the latest version of clearml ?
It was working for me. Anyway I modified the callback. Attached is the script that has the issue for me whenever I add random_image_logger to the callbacks It only logs some of the scalars for 1 epoch. It then is stuck and never recovers. When I remove random_image_logger the scalars are correctly logged. Again this only on 1 computer, other computers we have logging work perfectly fine
When the script is hung at the end the experiment says failed in ClearML
I am still having this issue. An update is that the "abort" does not work. Even though the state is correctly tracked in ClearML when I try to abort the experiment through the UI it says it does it but the experiment remains running on the computer.
Thank you @<1719524641879363584:profile|ThankfulClams64> for opening the GI, hopefully we will be able to reproduce it and fox ot quickly
Is this just the console output while training?
I'm not sure how to even troubleshoot this.
There is clearly some connection to the ClearML server as it remains "running" the entire training session but there are no metrics or debug samples. And I see nothing in the logs to indicate there is an issue
Is there someway to kill all connections of a machine to the ClearML server this does seem to be related to restarting a task / running a new task quickly after a task fails or is aborted
Yes it shows on the UI and has the first epoch for some of the metrics but that's it. It has run like 50 epochs, it says it is still running but there are no updates to the scalars or debug samples