This is what I get when running on Clearml. Notice the nan in the loss
Epoch 1/150
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1739804333.538008 890492 service.cc:145] XLA service 0x7f19b80029d0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
I0000 00:00:1739804333.538068 890492 service.cc:153] StreamExecutor device (0): NVIDIA GeForce RTX 2080 Ti, Compute Capability 7.5
2025-02-17 14:58:54.021777: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:268] disabling MLIR crash reproducer, set env var MLIR_CRASH_REPRODUCER_DIRECTORY
to enable.
2025-02-17 14:58:54.823385: I external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:465] Loaded cuDNN version 8906
I0000 00:00:1739804355.312793 890492 device_compiler.h:188] Compiled cluster using XLA! This line is logged at most once for the lifetime of the process.
ClearML Monitor: Could not detect iteration reporting, falling back to iterations as seconds-from-start
2025-02-17 15:02:43,896 - clearml - INFO - NaN value encountered. Reporting it as '0.0'. Use clearml.Logger.set_reporting_nan_value to assign another value
Saved artifact at '/tmp/tmpmfu4qumg/model'. The following endpoints are available:
- Endpoint 'serve'
args_0 (POSITIONAL_ONLY): TensorSpec(shape=(None, 1, 256, 256, 1), dtype=tf.float32, name='keras_tensor_51')
Output Type:
TensorSpec(shape=(None, 256, 256, 3), dtype=tf.float32, name=None)
Captures:
139737631767920: TensorSpec(shape=(1, 1, 1, 1, 1), dtype=tf.float32, name=None)
139737631767040: TensorSpec(shape=(1, 1, 1, 1, 1), dtype=tf.float32, name=None)
139737631768272: TensorSpec(shape=(1, 1, 256, 256, 1), dtype=tf.float32, name=None)
139753012177808: TensorSpec(shape=(), dtype=tf.resource, name=None)
139753012178864: TensorSpec(shape=(), dtype=tf.resource, name=None)
139753012176224: TensorSpec(shape=(), dtype=tf.resource, name=None)
139753012180624: TensorSpec(shape=(), dtype=tf.resource, name=None)
139753012178512: TensorSpec(shape=(), dtype=tf.resource, name=None)
139753012179744: TensorSpec(shape=(), dtype=tf.resource, name=None)
139753054298080: TensorSpec(shape=(), dtype=tf.resource, name=None)
139753012174288: TensorSpec(shape=(), dtype=tf.resource, name=None)
139753012176400: TensorSpec(shape=(), dtype=tf.resource, name=None)
139753012181328: TensorSpec(shape=(), dtype=tf.resource, name=None)
139753012180976: TensorSpec(shape=(), dtype=tf.resource, name=None)
139753012180800: TensorSpec(shape=(), dtype=tf.resource, name=None)
139753012182208: TensorSpec(shape=(), dtype=tf.resource, name=None)
139753012182560: TensorSpec(shape=(), dtype=tf.resource, name=None)
139753012181680: TensorSpec(shape=(), dtype=tf.resource, name=None)
139753012181504: TensorSpec(shape=(), dtype=tf.resource, name=None)
139737633002192: TensorSpec(shape=(), dtype=tf.resource, name=None)
139737633002368: TensorSpec(shape=(), dtype=tf.resource, name=None)
139737633003424: TensorSpec(shape=(), dtype=tf.resource, name=None)
139737633003600: TensorSpec(shape=(), dtype=tf.resource, name=None)
139737633004832: TensorSpec(shape=(), dtype=tf.resource, name=None)
139737633005184: TensorSpec(shape=(), dtype=tf.resource, name=None)
139737633003952: TensorSpec(shape=(), dtype=tf.resource, name=None)
139737633003776: TensorSpec(shape=(), dtype=tf.resource, name=None)
139737633006064: TensorSpec(shape=(), dtype=tf.resource, name=None)
139737633006240: TensorSpec(shape=(), dtype=tf.resource, name=None)
139737633006768: TensorSpec(shape=(), dtype=tf.resource, name=None)
139737633006592: TensorSpec(shape=(), dtype=tf.resource, name=None)
139737633008000: TensorSpec(shape=(), dtype=tf.resource, name=None)
139737633008352: TensorSpec(shape=(), dtype=tf.resource, name=None)
139737633007472: TensorSpec(shape=(), dtype=tf.resource, name=None)
139737633007296: TensorSpec(shape=(), dtype=tf.resource, name=None)
139737633009056: TensorSpec(shape=(), dtype=tf.resource, name=None)
139737633009232: TensorSpec(shape=(), dtype=tf.resource, name=None)
139737633009760: TensorSpec(shape=(), dtype=tf.resource, name=None)
139737633009584: TensorSpec(shape=(), dtype=tf.resource, name=None)
139737633009408: TensorSpec(shape=(), dtype=tf.resource, name=None)
139737633011168: TensorSpec(shape=(), dtype=tf.resource, name=None)
139737633008704: TensorSpec(shape=(), dtype=tf.resource, name=None)
139737633010464: TensorSpec(shape=(), dtype=tf.resource, name=None)
139737633012400: TensorSpec(shape=(), dtype=tf.resource, name=None)
139737633012576: TensorSpec(shape=(), dtype=tf.resource, name=None)
139737633012928: TensorSpec(shape=(), dtype=tf.resource, name=None)
139737633013104: TensorSpec(shape=(), dtype=tf.resource, name=None)
139737631753312: TensorSpec(shape=(), dtype=tf.resource, name=None)
139737631753664: TensorSpec(shape=(), dtype=tf.resource, name=None)
139737631752432: TensorSpec(shape=(), dtype=tf.resource, name=None)
139737631752256: TensorSpec(shape=(), dtype=tf.resource, name=None)
139737631754720: TensorSpec(shape=(), dtype=tf.resource, name=None)
139737631754896: TensorSpec(shape=(), dtype=tf.resource, name=None)
139737631755424: TensorSpec(shape=(), dtype=tf.resource, name=None)
139737631755248: TensorSpec(shape=(), dtype=tf.resource, name=None)
139737631755072: TensorSpec(shape=(), dtype=tf.resource, name=None)
139737631756832: TensorSpec(shape=(), dtype=tf.resource, name=None)
139737631754544: TensorSpec(shape=(), dtype=tf.resource, name=None)
139737631756128: TensorSpec(shape=(), dtype=tf.resource, name=None)
139737631760704: TensorSpec(shape=(), dtype=tf.resource, name=None)
139737631760880: TensorSpec(shape=(), dtype=tf.resource, name=None)
139737631761408: TensorSpec(shape=(), dtype=tf.resource, name=None)
139737631761232: TensorSpec(shape=(), dtype=tf.resource, name=None)
139737631761056: TensorSpec(shape=(), dtype=tf.resource, name=None)
139737631762816: TensorSpec(shape=(), dtype=tf.resource, name=None)
139737631760528: TensorSpec(shape=(), dtype=tf.resource, name=None)
139737631762112: TensorSpec(shape=(), dtype=tf.resource, name=None)
139737631763872: TensorSpec(shape=(), dtype=tf.resource, name=None)
139737631764048: TensorSpec(shape=(), dtype=tf.resource, name=None)
139737631764576: TensorSpec(shape=(), dtype=tf.resource, name=None)
139737631764400: TensorSpec(shape=(), dtype=tf.resource, name=None)
139737631764224: TensorSpec(shape=(), dtype=tf.resource, name=None)
139737631765984: TensorSpec(shape=(), dtype=tf.resource, name=None)
139737631763696: TensorSpec(shape=(), dtype=tf.resource, name=None)
139737631765280: TensorSpec(shape=(), dtype=tf.resource, name=None)
139737631766688: TensorSpec(shape=(), dtype=tf.resource, name=None)
139737631766864: TensorSpec(shape=(), dtype=tf.resource, name=None)
2025-02-17 15:02:46,852 - clearml.Task - INFO - Completed model upload to None
1939/1939 - 242s - 125ms/step - dice_Bg: nan - dice_Lumen: nan - dice_all: nan - loss: nan - val_dice_Bg: nan - val_dice_Lumen: nan - val_dice_all: nan - val_loss: nan
Epoch 2/150