ThankfulClams64 you could try using the compare function in the UI to compare the experiments on the machine the scalars are not reported properly and the experiments on a machine that runs the experiments properly. I suggest then replicating the environment exactly on the problematic machine. None
Yes it is logging to the console. The script does hang whenever it completes all the epochs when it is having the issue.
Then we also connect two dictionaries for configs
task.connect(model_config)
task.connect(DataAugConfig)
ThankfulClams64 , are logs showing up without issue on the 'problematic' machine?
I'm not sure if it still reports logs. But it will continue running on the machine
If you remove any reference of ClearML from the code on that machine, does it still hang?
Yes it shows on the UI and has the first epoch for some of the metrics but that's it. It has run like 50 epochs, it says it is still running but there are no updates to the scalars or debug samples
Yes tensorboard. It is still logging the tensorboard scalers and images. It just doesn't log the console output
So I was able to repeat the same behavior on a machine running this example None
by adding the following callback
class TensorBoardImage(TensorBoard):
@staticmethod
def make_image(tensor):
from PIL import Image
import io
tensor = np.stack((tensor, tensor, tensor), axis=2)
height, width, channels = tensor.shape
image = Image.fromarray(tensor)
output = io.BytesIO()
image.save(output, format='PNG')
image_string = output.getvalue()
output.close()
return tf.Summary.Image(height=height,
width=width,
colorspace=channels,
encoded_image_string=image_string)
def on_epoch_end(self, epoch, logs=None):
if logs is None:
logs = {}
super(TensorBoardImage, self).on_epoch_end(epoch, logs)
images = self.validation_data[0] # 0 - data; 1 - labels
img = (255 * images[0].reshape(28, 28)).astype('uint8')
image = self.make_image(img)
summary = tf.Summary(value=[tf.Summary.Value(tag='image', image=image)])
self.writer.add_summary(summary, epoch)
So it seems like there is some bug in the how ClearML is logging tensorbaord images that causes everything to fail
Just to make sure, did the logging to the clearml server work previously and stoped working at some point?
Yea I am fine not having the console logging. My issues is the scalers and debug images occasionally don't record to ClearML
That makes sense... If you turn auto_connect_streams to false this mean that auto reporting will be disabled as per the documentation.. If you turn it to True then logging should resume.
Hi ThankfulClams64 ! What tensorflow/keras version are you using? I noticed that in the TensorBoardImage
you are using tf.Summary
which no longer exists since tensorflow 2.2.3
, which I believe is too old to work with tesorboard==2.16.2.
Also, how are you stopping and starting the experiments? When starting an experiment, are you resuming training? In that case, you might want to consider setting the initial iteration to the last iteration your program reported
Hi ThankfulClams64 ,the logging is done by a separate process, I'm pretty sure it's not terminating all of the sudden. Did you manage to get a full log of such an experiment to share?
Another thing I notice is that aborting the experiment does not work when this is happening. It just continues to run
Yes I see it in the terminal on the machine
ThankfulClams64 , if you set auto_connect_streams to false nothing will be reported from your frameworks. With what frameworks are you working, tensorboard?
Hi ThankfulClams64 , stopping all processes should do that, there is no programmatic way of doing that specifically. Did you try calling task.close()
for all tasks you're using?
The console logging still works. Aborting the task was in the log but did not work and the process continued until I killed it.
I will try with clearml==1.16.3rc2 and see if it still has the issue
My bad, if you set auto_connect_streams to false, you basically disable the console logging... Please see the documentation:
auto_connect_streams (Union[bool, Mapping[str, bool]]) – Control the automatic logging of stdout and stderr.
Console output and also what you get on the ClearML task page under the console section
When I try to abort an experiment. I get this in the log
clearml.Task - WARNING - ### TASK STOPPED - USER ABORTED - STATUS CHANGED ###
but it does not stop anything it just continues to run
Correct, so I get something like this
ClearML Task: created new task id=6ec57dcb007545aebc4ec51eb5b34c67
======> WARNING! Git diff too large to store (2536kb), skipping uncommitted changes <======
ClearML results page:
but that is all
The same training works sometimes. But I'm not sure how to troubleshoot when it stops logging the metrics
What happens if you're running the reporting example from the ClearML github repository?
Running clearml_example.py in None reproduces the issue
task.connect(model_config)
task.connect(DataAugConfig)
If these are separate dictionaries , you should probably use two sections:
task.connect(model_config, name="model config")
task.connect(DataAugConfig, name="data aug")
It is still getting stuck.
I notice that one of the scalars that gets logged early is logging the epoch while the remaining scalars seem to be iterations because the iteration value is 1355 instead of 26
wait so you are seeing Some scalars ?
while the remaining scalars seem to be iterations because the iteration value is 1355 instead of 26
what are you seeing in your TB?