It was working for me. Anyway I modified the callback. Attached is the script that has the issue for me whenever I add random_image_logger
to the callbacks It only logs some of the scalars for 1 epoch. It then is stuck and never recovers. When I remove random_image_logger
the scalars are correctly logged. Again this only on 1 computer, other computers we have logging work perfectly fine
I just created a new virtual environment and the problem persists. There are only two dependencies clearml and tensorflow. @<1523701070390366208:profile|CostlyOstrich36> what logs are you referring to?
When I try to abort an experiment. I get this in the log
clearml.Task - WARNING - ### TASK STOPPED - USER ABORTED - STATUS CHANGED ###
but it does not stop anything it just continues to run
So even if you abort it on the start of the experiment it will keep running and reporting logs?
Any chance you have some uncommited code changes that, when not included, this works fine?
I'm not sure if it still reports logs. But it will continue running on the machine
When the script is hung at the end the experiment says failed in ClearML
Not sure why that is related to saving images
Hi @<1719524641879363584:profile|ThankfulClams64> ! What tensorflow/keras version are you using? I noticed that in the TensorBoardImage
you are using tf.Summary
which no longer exists since tensorflow 2.2.3
, which I believe is too old to work with tesorboard==2.16.2.
Also, how are you stopping and starting the experiments? When starting an experiment, are you resuming training? In that case, you might want to consider setting the initial iteration to the last iteration your program reported
It is not always reproducible it seems like something that we do not understand happens then the machine consistently has this issue. We believe it has something to do with stopping and starting experiments
If you remove any reference of ClearML from the code on that machine, does it still hang?
Console output and also what you get on the ClearML task page under the console section
sometimes I get no scalars, but the console logging always seems to be working
The machine currently having the issue is on tensorboard==2.16.2
STATUS MESSAGE: N/A
STATUS REASON: Signal None
Hi @<1719524641879363584:profile|ThankfulClams64>
I am using ClearML Pro and pretty regularly I will restart an experiment and nothing will get logged to ClearML.
I use ClearML with pytorch 1.7.1, pytorch-lightning 1.2.2 and Tensorboard auto
All ClearML has the latest stable updates. (clearml 1.7.4, clearml-agent 1.7.2)
Is this still happening with the latest clearml ( clearml==1.16.3rc2
) ?
What is the TB version?
I remember a fix regrading lightining support
Also just making sure, are you using the default lightning TB logger ?
How are you initializing the Task.init
(i.e. could you copy here the code?)
task.connect(model_config)
task.connect(DataAugConfig)
If these are separate dictionaries , you should probably use two sections:
task.connect(model_config, name="model config")
task.connect(DataAugConfig, name="data aug")
It is still getting stuck.
I notice that one of the scalars that gets logged early is logging the epoch while the remaining scalars seem to be iterations because the iteration value is 1355 instead of 26
wait so you are seeing Some scalars ?
while the remaining scalars seem to be iterations because the iteration value is 1355 instead of 26
what are you seeing in your TB?
So I am only seeing values for the first epoch. It seems like it does not track all of them so maybe something is happening when it tries to log scalars.
I have seen it only log iterations but setting task.set_initial_iteration(0)
seemed to fix that so it now seems to be logging the correct epoch
Tensorboard is correct and works. I have never seen an issue in the tensorboard logs
This was on the same machine I am having issues with it logs scalars correctly using the example code, but when I add in that callback which just logs a random image to tensorboard I don't get any scalars logged
Then we also connect two dictionaries for configs
task.connect(model_config)
task.connect(DataAugConfig)
Hi @<1719524641879363584:profile|ThankfulClams64> , does the experiment itself show on the ClearML UI?
Running clearml_example.py in None reproduces the issue
What happens if you're running the reporting example from the ClearML github repository?
I will try with clearml==1.16.3rc2 and see if it still has the issue
Yes I see it in the terminal on the machine
@<1719524641879363584:profile|ThankfulClams64> , are logs showing up without issue on the 'problematic' machine?
Just to make sure, did the logging to the clearml server work previously and stoped working at some point?