Yes tensorboard. It is still logging the tensorboard scalers and images. It just doesn't log the console output
My bad, if you set auto_connect_streams to false, you basically disable the console logging... Please see the documentation:
auto_connect_streams (Union[bool, Mapping[str, bool]]) – Control the automatic logging of stdout and stderr.
Can you try with auto_connect_streams=True ? Also, what version of clearml
sdk are you using?
That makes sense... If you turn auto_connect_streams to false this mean that auto reporting will be disabled as per the documentation.. If you turn it to True then logging should resume.
I'll update my clearml version. Unfortunately I do not have a small code snippet and it is not always repeatable. Is there some additional logging that can be turned on?
Yea I am fine not having the console logging. My issues is the scalers and debug images occasionally don't record to ClearML
There is clearly some connection to the ClearML server as it remains "running" the entire training session but there are no metrics or debug samples. And I see nothing in the logs to indicate there is an issue
I am using 1.15.0. Yes I can try with auto_connect_streams set to True I believe I will still have the issue
Do you also see the same in the terminal itself on the machine?
Okay I will do another run to capture the console output. We currently set auto_connect_streams to False to reduce the number of API calls. So there isn't really anything in the ClearML task page console section
Hi @<1719524641879363584:profile|ThankfulClams64> , does the experiment itself show on the ClearML UI?
The same training works sometimes. But I'm not sure how to troubleshoot when it stops logging the metrics
@<1719524641879363584:profile|ThankfulClams64> , if you set auto_connect_streams to false nothing will be reported from your frameworks. With what frameworks are you working, tensorboard?
What happens if you're running the reporting example from the ClearML github repository?
Thank you @<1719524641879363584:profile|ThankfulClams64> for opening the GI, hopefully we will be able to reproduce it and fox ot quickly
When the script is hung at the end the experiment says failed in ClearML
Yes I see it in the terminal on the machine
When I try to abort an experiment. I get this in the log
clearml.Task - WARNING - ### TASK STOPPED - USER ABORTED - STATUS CHANGED ###
but it does not stop anything it just continues to run
If you remove any reference of ClearML from the code on that machine, does it still hang?
It was working for me. Anyway I modified the callback. Attached is the script that has the issue for me whenever I add random_image_logger
to the callbacks It only logs some of the scalars for 1 epoch. It then is stuck and never recovers. When I remove random_image_logger
the scalars are correctly logged. Again this only on 1 computer, other computers we have logging work perfectly fine
Another thing I notice is that aborting the experiment does not work when this is happening. It just continues to run
Does any exit code appear? What is the status message and status reason in the 'INFO' section?
STATUS MESSAGE: N/A
STATUS REASON: Signal None
It is still getting stuck. I think the issue might have something to do with the iterations versus epochs. I notice that one of the scalars that gets logged early is logging the epoch while the remaining scalars seem to be iterations because the iteration value is 1355 instead of 26
I am still having this issue. An update is that the "abort" does not work. Even though the state is correctly tracked in ClearML when I try to abort the experiment through the UI it says it does it but the experiment remains running on the computer.
Just to make sure, did the logging to the clearml server work previously and stoped working at some point?
The console logging still works. Aborting the task was in the log but did not work and the process continued until I killed it.
Hi @<1719524641879363584:profile|ThankfulClams64>
I am using ClearML Pro and pretty regularly I will restart an experiment and nothing will get logged to ClearML.
I use ClearML with pytorch 1.7.1, pytorch-lightning 1.2.2 and Tensorboard auto
All ClearML has the latest stable updates. (clearml 1.7.4, clearml-agent 1.7.2)
Is this still happening with the latest clearml ( clearml==1.16.3rc2
) ?
What is the TB version?
I remember a fix regrading lightining support
Also just making sure, are you using the default lightning TB logger ?
How are you initializing the Task.init
(i.e. could you copy here the code?)