Thanks @<1719524641879363584:profile|ThankfulClams64> having a code that can reproduce it is exactly what we need.
One thing I might have missed and is very important , what is your tensorboard package version?
Correct, so I get something like this
ClearML Task: created new task id=6ec57dcb007545aebc4ec51eb5b34c67
======> WARNING! Git diff too large to store (2536kb), skipping uncommitted changes <======
ClearML results page:
but that is all
Not sure why that is related to saving images
I found that setting store_uncommitted_code_diff: false
instead of true seems to fix the issue
Hi @<1719524641879363584:profile|ThankfulClams64> , stopping all processes should do that, there is no programmatic way of doing that specifically. Did you try calling task.close()
for all tasks you're using?
So even if you abort it on the start of the experiment it will keep running and reporting logs?
I just created a new virtual environment and the problem persists. There are only two dependencies clearml and tensorflow. @<1523701070390366208:profile|CostlyOstrich36> what logs are you referring to?
So I was able to repeat the same behavior on a machine running this example None
by adding the following callback
class TensorBoardImage(TensorBoard):
@staticmethod
def make_image(tensor):
from PIL import Image
import io
tensor = np.stack((tensor, tensor, tensor), axis=2)
height, width, channels = tensor.shape
image = Image.fromarray(tensor)
output = io.BytesIO()
image.save(output, format='PNG')
image_string = output.getvalue()
output.close()
return tf.Summary.Image(height=height,
width=width,
colorspace=channels,
encoded_image_string=image_string)
def on_epoch_end(self, epoch, logs=None):
if logs is None:
logs = {}
super(TensorBoardImage, self).on_epoch_end(epoch, logs)
images = self.validation_data[0] # 0 - data; 1 - labels
img = (255 * images[0].reshape(28, 28)).astype('uint8')
image = self.make_image(img)
summary = tf.Summary(value=[tf.Summary.Value(tag='image', image=image)])
self.writer.add_summary(summary, epoch)
So it seems like there is some bug in the how ClearML is logging tensorbaord images that causes everything to fail
We are running the same code on multiple machines and it just randomly happens. Currently we are having the issue on 1 out of 4
Yes it shows on the UI and has the first epoch for some of the metrics but that's it. It has run like 50 epochs, it says it is still running but there are no updates to the scalars or debug samples
Is there someway to kill all connections of a machine to the ClearML server this does seem to be related to restarting a task / running a new task quickly after a task fails or is aborted
Console output and also what you get on the ClearML task page under the console section
I'm not sure how to even troubleshoot this.
Yes it is logging to the console. The script does hang whenever it completes all the epochs when it is having the issue.
task.connect(model_config)
task.connect(DataAugConfig)
If these are separate dictionaries , you should probably use two sections:
task.connect(model_config, name="model config")
task.connect(DataAugConfig, name="data aug")
It is still getting stuck.
I notice that one of the scalars that gets logged early is logging the epoch while the remaining scalars seem to be iterations because the iteration value is 1355 instead of 26
wait so you are seeing Some scalars ?
while the remaining scalars seem to be iterations because the iteration value is 1355 instead of 26
what are you seeing in your TB?
Hi we are currently having the issue. There is nothing in the console regarding ClearML besides
ClearML Task: created new task id=0174d5b9d7164f47bd10484fd268e3ff
======> WARNING! Git diff too large to store (3611kb), skipping uncommitted changes <======
ClearML results page:
The console logs continue to come in put no scalers or debug images show up.
It seems similar to this None is it possible saving too many model weights causes metric logging thread to die?
I do have uncommitted code changes. I can try to check at some point if it would not have the problem without them. It seems like it could be repeated just by making a git repo with that script and adding a very large file. If I can repeat it is it best to open an issue in GitHub?
I'm not sure if it still reports logs. But it will continue running on the machine
@<1719524641879363584:profile|ThankfulClams64> , are logs showing up without issue on the 'problematic' machine?
Thank you @<1719524641879363584:profile|ThankfulClams64> for opening the GI, hopefully we will be able to reproduce it and fox ot quickly
When I try to abort an experiment. I get this in the log
clearml.Task - WARNING - ### TASK STOPPED - USER ABORTED - STATUS CHANGED ###
but it does not stop anything it just continues to run
Yes I see it in the terminal on the machine
Can you try with auto_connect_streams=True ? Also, what version of clearml
sdk are you using?
Running clearml_example.py in None reproduces the issue
Hi @<1719524641879363584:profile|ThankfulClams64> ! What tensorflow/keras version are you using? I noticed that in the TensorBoardImage
you are using tf.Summary
which no longer exists since tensorflow 2.2.3
, which I believe is too old to work with tesorboard==2.16.2.
Also, how are you stopping and starting the experiments? When starting an experiment, are you resuming training? In that case, you might want to consider setting the initial iteration to the last iteration your program reported
I will try with clearml==1.16.3rc2 and see if it still has the issue