I'm experiencing an issue with my YOLO training script when using ClearML. Although the training process itself completes successfully (as indicated by the "training is finished" message), the script appears to hang indefinitely after this point. The process doesn't terminate on its own, forcing me to use CTRL+C to stop it manually.
Code Snippet
import os
os.environ['YOLO_VERBOSE'] = 'false'
from ultralytics import YOLO
import multiprocessing as mp
from clearml import Task
task = Task.init(
project_name='TEST',
task_name="YOLO_TRAIN",
output_uri=True,
)
# Load a model
model = YOLO(
"models/yolo11n-seg.pt"
) # load a pretrained model (recommended for training)
# Train the model
print('initializing training..')
model.train(
data="data/YOLO_DATASETS/data.yml",
batch=-1,
lr0=1e-3,
optimizer="AdamW",
epochs=1,
imgsz=1024,
pretrained=True,
verbose=False,
workers=mp.cpu_count(),
patience=200,
plots=True
)
print('training is finished')
task.flush()
task.close()
Console Output
(yolo-training) joao@LCPServer:~/Experimentos/_CLEARML/project$ python script.py
ClearML Task: created new task id=7f690a8a560f4655a79b2d015a33c5dd
======> WARNING! Git diff too large to store (1323kb), skipping uncommitted changes <======
ClearML results page:
2025-02-27 17:30:06,673 - clearml.model - INFO - Selected model id: 93bb56d6459a461c928ec14e493d4ded
initializing training..
/home/joao/miniconda3/envs/yolo-training/lib/python3.9/site-packages/albumentations/__init__.py:13: UserWarning:
A new version of Albumentations is available: 2.0.4 (you have 1.4.17). Upgrade using: pip install -U albumentations. To disable automatic update checks, set the environment variable NO_ALBUMENTATIONS_UPDATE to 1.
training is finished
0% | 0.00/5.8 MB [00:00<?, ?MB/s]: /home/joao/miniconda3/envs/yolo-training/lib/python3.9/site-packages/tqdm/std.py:636: TqdmWarning:
clamping frac to range [0, 1]
██████████████████████████████████ 100% | 5.80/5.8 MB [00:00<00:00, 13.80MB/s]:
At this point, the script hangs indefinitely, and I have to manually terminate it with CTRL+C, which produces the following stack trace:
^CTraceback (most recent call last):
File "/home/joao/Experimentos/_CLEARML/project/script.py", line 37, in <module>
task.close()
File "/home/joao/miniconda3/envs/yolo-training/lib/python3.9/site-packages/clearml/task.py", line 2504, in close
self.__shutdown()
File "/home/joao/miniconda3/envs/yolo-training/lib/python3.9/site-packages/clearml/task.py", line 4656, in __shutdown
self.flush(wait_for_uploads=True)
File "/home/joao/miniconda3/envs/yolo-training/lib/python3.9/site-packages/clearml/task.py", line 2453, in flush
self.__reporter.wait_for_events()
File "/home/joao/miniconda3/envs/yolo-training/lib/python3.9/site-packages/clearml/backend_interface/metrics/reporter.py", line 337, in wait_for_events
return report_service.wait_for_events(timeout=timeout)
File "/home/joao/miniconda3/envs/yolo-training/lib/python3.9/site-packages/clearml/backend_interface/metrics/reporter.py", line 129, in wait_for_events
if self._empty_state_event.wait(timeout=1.0):
File "/home/joao/miniconda3/envs/yolo-training/lib/python3.9/site-packages/clearml/utilities/process/mp.py", line 449, in wait
return self._event.wait(timeout=timeout)
File "/home/joao/miniconda3/envs/yolo-training/lib/python3.9/multiprocessing/synchronize.py", line 349, in wait
self._cond.wait(timeout)
File "/home/joao/miniconda3/envs/yolo-training/lib/python3.9/multiprocessing/synchronize.py", line 261, in wait
return self._wait_semaphore.acquire(True, timeout)
File "/home/joao/miniconda3/envs/yolo-training/lib/python3.9/site-packages/clearml/utilities/process/exit_hooks.py", line 157, in signal_handler
return org_handler if not callable(org_handler) else org_handler(sig, frame)
KeyboardInterrupt
Environment
- Python 3.9
- YOLO training environment (conda)
- ClearML latest version
- Ultralytics YOLOQuestions
- Why does the script hang after training completion, even though "training is finished" is printed?
- Are there any recommended configurations or changes to make the script terminate properly after training?
- Could this be related to background processes or threads started by either YOLO or ClearML that aren't being properly closed?Any guidance or suggestions for fixing this issue would be greatly appreciated.