The machine currently having the issue is on tensorboard==2.16.2
Thanks @<1719524641879363584:profile|ThankfulClams64> having a code that can reproduce it is exactly what we need.
One thing I might have missed and is very important , what is your tensorboard package version?
sometimes I get no scalars, but the console logging always seems to be working
Okay I will do another run to capture the console output. We currently set auto_connect_streams to False to reduce the number of API calls. So there isn't really anything in the ClearML task page console section
The same training works sometimes. But I'm not sure how to troubleshoot when it stops logging the metrics
Any chance you have some uncommited code changes that, when not included, this works fine?
I am on 1.16.2
task = Task.init(project_name=model_config['ClearML']['project_name'],
task_name=model_config['ClearML']['task_name'],
continue_last_task=False,
auto_connect_streams=True)
I'm not sure if it still reports logs. But it will continue running on the machine
So even if you abort it on the start of the experiment it will keep running and reporting logs?
Hi @<1719524641879363584:profile|ThankfulClams64> , does the experiment itself show on the ClearML UI?
That makes sense... If you turn auto_connect_streams to false this mean that auto reporting will be disabled as per the documentation.. If you turn it to True then logging should resume.
Yes it is logging to the console. The script does hang whenever it completes all the epochs when it is having the issue.
I do have uncommitted code changes. I can try to check at some point if it would not have the problem without them. It seems like it could be repeated just by making a git repo with that script and adding a very large file. If I can repeat it is it best to open an issue in GitHub?
It is still getting stuck. I think the issue might have something to do with the iterations versus epochs. I notice that one of the scalars that gets logged early is logging the epoch while the remaining scalars seem to be iterations because the iteration value is 1355 instead of 26
It is not always reproducible it seems like something that we do not understand happens then the machine consistently has this issue. We believe it has something to do with stopping and starting experiments
I'm not sure how to even troubleshoot this.
@<1719524641879363584:profile|ThankfulClams64> , if you set auto_connect_streams to false nothing will be reported from your frameworks. With what frameworks are you working, tensorboard?
If you remove any reference of ClearML from the code on that machine, does it still hang?
The console logging still works. Aborting the task was in the log but did not work and the process continued until I killed it.
I am using 1.15.0. Yes I can try with auto_connect_streams set to True I believe I will still have the issue
So I was able to repeat the same behavior on a machine running this example None
by adding the following callback
class TensorBoardImage(TensorBoard):
@staticmethod
def make_image(tensor):
from PIL import Image
import io
tensor = np.stack((tensor, tensor, tensor), axis=2)
height, width, channels = tensor.shape
image = Image.fromarray(tensor)
output = io.BytesIO()
image.save(output, format='PNG')
image_string = output.getvalue()
output.close()
return tf.Summary.Image(height=height,
width=width,
colorspace=channels,
encoded_image_string=image_string)
def on_epoch_end(self, epoch, logs=None):
if logs is None:
logs = {}
super(TensorBoardImage, self).on_epoch_end(epoch, logs)
images = self.validation_data[0] # 0 - data; 1 - labels
img = (255 * images[0].reshape(28, 28)).astype('uint8')
image = self.make_image(img)
summary = tf.Summary(value=[tf.Summary.Value(tag='image', image=image)])
self.writer.add_summary(summary, epoch)
So it seems like there is some bug in the how ClearML is logging tensorbaord images that causes everything to fail
When I try to abort an experiment. I get this in the log
clearml.Task - WARNING - ### TASK STOPPED - USER ABORTED - STATUS CHANGED ###
but it does not stop anything it just continues to run
Hi @<1719524641879363584:profile|ThankfulClams64> , stopping all processes should do that, there is no programmatic way of doing that specifically. Did you try calling task.close() for all tasks you're using?
Is there someway to kill all connections of a machine to the ClearML server this does seem to be related to restarting a task / running a new task quickly after a task fails or is aborted
Running clearml_example.py in None reproduces the issue
It seems similar to this None is it possible saving too many model weights causes metric logging thread to die?
@<1719524641879363584:profile|ThankfulClams64> you could try using the compare function in the UI to compare the experiments on the machine the scalars are not reported properly and the experiments on a machine that runs the experiments properly. I suggest then replicating the environment exactly on the problematic machine. None
Yes it shows on the UI and has the first epoch for some of the metrics but that's it. It has run like 50 epochs, it says it is still running but there are no updates to the scalars or debug samples