Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Unanswered
Hi, I See Some Strange Behavior Where The Training Fails When Running On Clearml (Loss = Nan) Compared To Running W/O Clearml. This Is Entirely Reproduceable. Has Anyone Seen This?


This is what I get when running the exact same training session without clearml
Epoch 1/150
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1739806371.262488 897794 service.cc:145] XLA service 0x7fc058066d20 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
I0000 00:00:1739806371.262578 897794 service.cc:153] StreamExecutor device (0): NVIDIA GeForce RTX 2080 Ti, Compute Capability 7.5
2025-02-17 15:32:51.772357: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:268] disabling MLIR crash reproducer, set env var MLIR_CRASH_REPRODUCER_DIRECTORY to enable.
2025-02-17 15:32:52.532296: I external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:465] Loaded cuDNN version 8906
I0000 00:00:1739806392.856744 897794 device_compiler.h:188] Compiled cluster using XLA! This line is logged at most once for the lifetime of the process.
1939/1939 - 233s - 120ms/step - dice_Bg: 0.9607 - dice_Lumen: 0.8216 - dice_all: 0.8216 - loss: 0.1238 - val_dice_Bg: 0.9862 - val_dice_Lumen: 0.8429 - val_dice_all: 0.8429 - val_loss: 0.0908
Epoch 2/150
1939/1939 - 203s - 105ms/step - dice_Bg: 0.9752 - dice_Lumen: 0.8748 - dice_all: 0.8748 - loss: 0.0816 - val_dice_Bg: 0.9889 - val_dice_Lumen: 0.8836 - val_dice_all: 0.8836 - val_loss: 0.0679

  
  
Posted 6 months ago
99 Views
0 Answers
6 months ago
6 months ago