Hello, I Am Training Some Models With Yolov8 And Want To Upload The Metrics To The Clearml Webpage In. However, Sometimes It Works And Sometimes It Does Not Work. Clearml Is Able To Read Everything From The Console And Stuff Like That, But Is Not Able To

Answered

Hello, I am training some models with yolov8 and want to upload the metrics to the clearml webpage in. However, sometimes it works and sometimes it does not work. Clearml is able to read everything from the console and stuff like that, but is not able to create plots on the scalar tab, such as precision, recall mAP and so on. It is stuck at iteration 0. Also when the training is completed it says it is still running in the webpage. I was wondering what I can do to fix this? I read online that there could be some issues if tensorboard is not installed, however it still does not work with tensorboard installed.

  				
Posted 
	2 years ago

					More  		
  Report
		
					FuzzyCentipede59
				
					0
					 × 1

Votes Newest

Answers 21

An update: using your code (the snippet above) I was getting no scalars when simply installing ultralytics and clearml packages using pip. Because indeed tensorboard is not installed. When I do install tensorboard, I get metrics in like normal, so I can't seem to reproduce the issue when tensorboard is correctly installed. That said, maybe we should look at not having this dependency 🤔

Would you mind posting a pip freeze of your environment that you're using to run yolo?

  				
Posted 
	2 years ago

					More  		
  Report
		
					ExasperatedCrab78
				
					0
					 × 1

Thank you for your assistance! These are the plugins installed in the environment:

absl-py==1.4.0
aiofiles==22.1.0
aiosqlite==0.18.0
anyio==3.6.2
argon2-cffi==21.3.0
argon2-cffi-bindings==21.2.0
arrow==1.2.3
asttokens==2.2.1
attrs==22.2.0
Babel==2.12.1
backcall==0.2.0
beautifulsoup4==4.12.2
bleach==6.0.0
cachetools==5.3.0
certifi==2022.12.7
cffi==1.15.1
charset-normalizer==3.1.0
clearml==1.10.3
cmake==3.26.1
comm==0.1.3
contourpy==1.0.7
cycler==0.11.0
debugpy==1.6.7
decorator==5.1.1
defusedxml==0.7.1
executing==1.2.0
fastjsonschema==2.16.3
filelock==3.11.0
fonttools==4.39.3
fqdn==1.5.1
furl==2.1.3
google-auth==2.17.2
google-auth-oauthlib==1.0.0
grpcio==1.53.0
htmlmin==0.1.12
idna==3.4
ImageHash==4.3.1
importlib-metadata==6.2.0
importlib-resources==5.12.0
install==1.3.5
ipykernel==6.22.0
ipython==8.12.0
ipython-genutils==0.2.0
ipywidgets==8.0.6
isoduration==20.11.0
jedi==0.18.2
Jinja2==3.1.2
joblib==1.2.0
json5==0.9.11
jsonpointer==2.3
jsonschema==4.17.3
jupyter==1.0.0
jupyter-console==6.6.3
jupyter-events==0.6.3
jupyter-ydoc==0.2.3
jupyter_client==8.1.0
jupyter_core==5.3.0
jupyter_server==2.5.0
jupyter_server_fileid==0.8.0
jupyter_server_terminals==0.4.4
jupyter_server_ydoc==0.8.0
jupyterlab==3.6.3
jupyterlab-pygments==0.2.2
jupyterlab-widgets==3.0.7
jupyterlab_server==2.22.0
kiwisolver==1.4.4
lit==16.0.0
Markdown==3.4.3
MarkupSafe==2.1.2
matplotlib==3.6.3
matplotlib-inline==0.1.6
mistune==2.0.5
mpmath==1.3.0
multimethod==1.9.1
nbclassic==0.5.5
nbclient==0.7.3
nbconvert==7.3.0
nbformat==5.8.0
nest-asyncio==1.5.6
networkx==3.1
notebook==6.5.4
notebook_shim==0.2.2
numpy==1.23.5
nvidia-cublas-cu11==11.10.3.66
nvidia-cuda-cupti-cu11==11.7.101
nvidia-cuda-nvrtc-cu11==11.7.99
nvidia-cuda-runtime-cu11==11.7.99
nvidia-cudnn-cu11==8.5.0.96
nvidia-cufft-cu11==10.9.0.58
nvidia-curand-cu11==10.2.10.91
nvidia-cusolver-cu11==11.4.0.1
nvidia-cusparse-cu11==11.7.4.91
nvidia-nccl-cu11==2.14.3
nvidia-nvtx-cu11==11.7.91
oauthlib==3.2.2
opencv-python==4.7.0.72
orderedmultidict==1.0.1
packaging==23.0
pandas==1.5.3
pandas-profiling==3.6.6
pandocfilters==1.5.0
parso==0.8.3
pathlib2==2.3.7.post1
patsy==0.5.3
pexpect==4.8.0
phik==0.12.3
pickleshare==0.7.5
Pillow==9.5.0
pkgutil_resolve_name==1.3.10
platformdirs==3.2.0
prometheus-client==0.16.0
prompt-toolkit==3.0.38
protobuf==4.22.1
psutil==5.9.4
ptyprocess==0.7.0
pure-eval==0.2.2
pyasn1==0.4.8
pyasn1-modules==0.2.8
pycparser==2.21
pydantic==1.10.7
Pygments==2.14.0
PyJWT==2.4.0
pyparsing==3.0.9
pyrsistent==0.19.3
python-dateutil==2.8.2
python-json-logger==2.0.7
pytz==2023.3
PyWavelets==1.4.1
PyYAML==6.0
pyzmq==25.0.2
qtconsole==5.4.2
QtPy==2.3.1
requests==2.28.2
requests-oauthlib==1.3.1
rfc3339-validator==0.1.4
rfc3986-validator==0.1.1
rsa==4.9
scipy==1.9.3
seaborn==0.12.2
Send2Trash==1.8.0
sentry-sdk==1.19.1
six==1.16.0
sniffio==1.3.0
soupsieve==2.4
stack-data==0.6.2
statsmodels==0.13.5
sympy==1.11.1
tangled-up-in-unicode==0.2.0
tensorboard==2.12.1
tensorboard-data-server==0.7.0
tensorboard-plugin-wit==1.8.1
terminado==0.17.1
thop==0.1.1.post2209072238
tinycss2==1.2.1
tomli==2.0.1
torch==2.0.0
torchvision==0.15.1
tornado==6.2
tqdm==4.64.1
traitlets==5.9.0
triton==2.0.0
typeguard==2.13.3
typing_extensions==4.5.0
tzdata==2023.3
ultralytics==8.0.71
uri-template==1.2.0
urllib3==1.26.15
visions==0.7.5
wcwidth==0.2.6
webcolors==1.13
webencodings==0.5.1
websocket-client==1.5.1
Werkzeug==2.2.3
widgetsnbextension==4.0.7
y-py==0.5.9
ydata-profiling==4.1.2
ypy-websocket==0.8.2
zipp==3.15.0

  				
Posted 
	2 years ago

					More  		
  Report
		
					FuzzyCentipede59
				
					0
					 × 1

One more thing: are you running the snippet inside a jupyter notebook (Wondering this because you have Jupyter in your environment)

  				
Posted 
	2 years ago

					More  		
  Report
		
					ExasperatedCrab78
				
					0
					 × 1

FuzzyCentipede59 Would you mind sharing how you're running the training? i.e. a minimal code example so we can reproduce the issue?

  				
Posted 
	2 years ago

					More  		
  Report
		
					ExasperatedCrab78
				
					0
					 × 1

Based on the screenshot of you package versions, it does seem like tensorboard is not installed there. We depend on that, because every scalar logged to tensorboard is captured in ClearML too. My guess would be that maybe you installed tensorboard in e.g. the wrong virtualenv.

However, you do say you tested it with Tensorboard and even then it didn't work. In that case, are the scalars correctly logged to tensorboard? You should be able to easily check this by doing a run, and then launching tensorboard to see if the scalars are coming in. If they are, but ClearML is not receiving them, it's probably a bug.

Would you mind checking that? In the meantime, I'll launch a version of my own with your script and see if I get the same issues

  				
Posted 
	2 years ago

					More  		
  Report
		
					ExasperatedCrab78
				
					0
					 × 1

However, I get these warnings:

TensorFlow installation not found - running with reduced feature set.
/cluster/home/project_tdt4265/.venv/lib/python3.8/site-packages/tensorboard_data_server/bin/server: /lib64/libc.so.6: version GLIBC_2.29' not found (required by /cluster/home/project_tdt4265/.venv/lib/python3.8/site-packages/tensorboard_data_server/bin/server) /cluster/home/haakobh/project_tdt4265/.venv/lib/python3.8/site-packages/tensorboard_data_server/bin/server: /lib64/libc.so.6: versionGLIBC_2.33' not found (required by /cluster/home/project_tdt4265/.venv/lib/python3.8/site-packages/tensorboard_data_server/bin/server)
/cluster/home/project_tdt4265/.venv/lib/python3.8/site-packages/tensorboard_data_server/bin/server: /lib64/libc.so.6: version GLIBC_2.34' not found (required by /cluster/home/project_tdt4265/.venv/lib/python3.8/site-packages/tensorboard_data_server/bin/server) /cluster/home/project_tdt4265/.venv/lib/python3.8/site-packages/tensorboard_data_server/bin/server: /lib64/libc.so.6: versionGLIBC_2.32' not found (required by /cluster/home/project_tdt4265/.venv/lib/python3.8/site-packages/tensorboard_data_server/bin/server)

  				
Posted 
	2 years ago

					More  		
  Report
		
					FuzzyCentipede59
				
					0
					 × 1

I will do that!

  				
Posted 
	2 years ago

					More  		
  Report
		
					FuzzyCentipede59
				
					0
					 × 1

And at the menu it says it is at iteration 2 even though the console log in the webpage says it is at epoch 8

  				
Posted 
	2 years ago

					More  		
  Report
		
					FuzzyCentipede59
				
					0
					 × 1

ExasperatedCrab78 Hey, I deleted the virtual environment and created a new one with python 3.9 and the necessary dependencies and now it seems to work! 😄 Thanks for your help! Maybe there were some packages interrupting or something with the python 3.8 version

  				
Posted 
	2 years ago

					More  		
  Report
		
					FuzzyCentipede59
				
					0
					 × 1

On my local machine it works also actually, there I am using python 3.9. The issue happens when training on the GPU Cluster for my university, there I am using python 3.8.2. I will try to create a new virtual environment with python 3.10.4 and see if it works then 🙂

  				
Posted 
	2 years ago

					More  		
  Report
		
					FuzzyCentipede59
				
					0
					 × 1

What version of ClearML are you using?

  				
Posted 
	2 years ago

					More  		
  Report
		
					CostlyOstrich36
				
					0

I am not running in notebook

  				
Posted 
	2 years ago

					More  		
  Report
		
					FuzzyCentipede59
				
					0
					 × 1

CostlyOstrich36 Do you know what potentially is the problem?

  				
Posted 
	2 years ago

					More  		
  Report
		
					FuzzyCentipede59
				
					0
					 × 1

ExasperatedCrab78 Sure! Here is my train file:

from ultralytics import YOLO

# Load a model
model = YOLO(model="yolov8m.pt")  # load a pretrained model (recommended for training)

# Train the model
model.train(
    data="data.yaml",
    epochs=200,
    imgsz=640,
    label_smoothing=0.1,
    shear=0.01,
    perspective=0.0001,
    mosaic=0.5,
    mixup=0.1,
)

and here is from the source code for yolov8

# Ultralytics YOLO :rocket:, GPL-3.0 license
import re

import matplotlib.image as mpimg
import matplotlib.pyplot as plt

from ultralytics.yolo.utils import LOGGER, TESTS_RUNNING
from ultralytics.yolo.utils.torch_utils import get_flops, get_num_params

try:
    import clearml
    from clearml import Task
    from clearml.binding.frameworks.pytorch_bind import PatchPyTorchModelIO
    from clearml.binding.matplotlib_bind import PatchedMatplotlib

    assert hasattr(clearml, '__version__')  # verify package is not directory
    assert not TESTS_RUNNING  # do not log pytest
except (ImportError, AssertionError):
    clearml = None


def _log_debug_samples(files, title='Debug Samples'):
    """
        Log files (images) as debug samples in the ClearML task.

        arguments:
        files (List(PosixPath)) a list of file paths in PosixPath format
        title (str) A title that groups together images with the same values
        """
    task = Task.current_task()
    if task:
        for f in files:
            if f.exists():
                it = re.search(r'_batch(\d+)', f.name)
                iteration = int(it.groups()[0]) if it else 0
                task.get_logger().report_image(title=title,
                                               series=f.name.replace(it.group(), ''),
                                               local_path=str(f),
                                               iteration=iteration)


def _log_plot(title, plot_path):
    """
        Log image as plot in the plot section of ClearML

        arguments:
        title (str) Title of the plot
        plot_path (PosixPath or str) Path to the saved image file
        """
    img = mpimg.imread(plot_path)
    fig = plt.figure()
    ax = fig.add_axes([0, 0, 1, 1], frameon=False, aspect='auto', xticks=[], yticks=[])  # no ticks
    ax.imshow(img)

    Task.current_task().get_logger().report_matplotlib_figure(title, '', figure=fig, report_interactive=False)


def on_pretrain_routine_start(trainer):
    try:
        task = Task.current_task()
        if task:
            # Make sure the automatic pytorch and matplotlib bindings are disabled!
            # We are logging these plots and model files manually in the integration
            PatchPyTorchModelIO.update_current_task(None)
            PatchedMatplotlib.update_current_task(None)
        else:
            task = Task.init(project_name=trainer.args.project or 'YOLOv8',
                             task_name=trainer.args.name,
                             tags=['YOLOv8'],
                             output_uri=True,
                             reuse_last_task_id=False,
                             auto_connect_frameworks={
                                 'pytorch': False,
                                 'matplotlib': False})
            LOGGER.warning('ClearML Initialized a new task. If you want to run remotely, '
                           'please add clearml-init and connect your arguments before initializing YOLO.')
        task.connect(vars(trainer.args), name='General')
    except Exception as e:
        LOGGER.warning(f'WARNING ⚠️ ClearML installed but not initialized correctly, not logging this run. {e}')


def on_train_epoch_end(trainer):
    if trainer.epoch == 1 and Task.current_task():
        _log_debug_samples(sorted(trainer.save_dir.glob('train_batch*.jpg')), 'Mosaic')


def on_fit_epoch_end(trainer):
    task = Task.current_task()
    if task:
        # You should have access to the validation bboxes under jdict
        task.get_logger().report_scalar(title='Epoch Time',
                                        series='Epoch Time',
                                        value=trainer.epoch_time,
                                        iteration=trainer.epoch)
        if trainer.epoch == 0:
            model_info = {
                'model/parameters': get_num_params(trainer.model),
                'model/GFLOPs': round(get_flops(trainer.model), 3),
                'model/speed(ms)': round(trainer.validator.speed['inference'], 3)}
            for k, v in model_info.items():
                task.get_logger().report_single_value(k, v)


def on_val_end(validator):
    if Task.current_task():
        # Log val_labels and val_pred
        _log_debug_samples(sorted(validator.save_dir.glob('val*.jpg')), 'Validation')


def on_train_end(trainer):
    task = Task.current_task()
    if task:
        # Log final results, CM matrix + PR plots
        files = ['results.png', 'confusion_matrix.png', *(f'{x}_curve.png' for x in ('F1', 'PR', 'P', 'R'))]
        files = [(trainer.save_dir / f) for f in files if (trainer.save_dir / f).exists()]  # filter
        for f in files:
            _log_plot(title=f.stem, plot_path=f)
        # Report final metrics
        for k, v in trainer.validator.metrics.results_dict.items():
            task.get_logger().report_single_value(k, v)
        # Log the final model
        task.update_output_model(model_path=str(trainer.best), model_name=trainer.args.name, auto_delete_file=False)


callbacks = {
    'on_pretrain_routine_start': on_pretrain_routine_start,
    'on_train_epoch_end': on_train_epoch_end,
    'on_fit_epoch_end': on_fit_epoch_end,
    'on_val_end': on_val_end,
    'on_train_end': on_train_end} if clearml else {}

  				
Posted 
	2 years ago

					More  		
  Report
		
					FuzzyCentipede59
				
					0
					 × 1

It says 1.10.3

  				
Posted 
	2 years ago

					More  		
  Report
		
					FuzzyCentipede59
				
					0
					 × 1

Interesting! I'm glad to know it's working now, only I now really want to know what caused it 😄 Let me know if you ever do find out!

  				
Posted 
	2 years ago

					More  		
  Report
		
					ExasperatedCrab78
				
					0
					 × 1

Hi FuzzyCentipede59 ! Can you share some snippets of your code, and tell us what you expect to see vs what you actually see is happening?

  				
Posted 
	2 years ago

					More  		
  Report
		
					SmugDolphin23
				
					0

With tensorboard I get these plots:

  				
Posted 
	2 years ago

					More  		
  Report
		
					FuzzyCentipede59
				
					0
					 × 1

I'm still struggling to reproduce the issue. Trying on my own PC locally as well as on google colab yields nothing.

The fact that you do get tensorboard logs, but none of them are captured by ClearML means there might be something wrong with our tensorboard bindings, but it's hard to pinpoint exactly what if I can't get it to fail like yours 😅 Let me try and instal exactly your environment using your packages above. Which python version are you using?

  				
Posted 
	2 years ago

					More  		
  Report
		
					ExasperatedCrab78
				
					0
					 × 1

I use the implementation of yolov8:

None

  				
Posted 
	2 years ago

					More  		
  Report
		
					FuzzyCentipede59
				
					0
					 × 1

The first image shows how it should look like, however in the second image the model is actually training on the 7th epoch but the scalars are not updated, they are just stuck on iteration 0

  				
Posted 
	2 years ago

					More  		
  Report
		
					FuzzyCentipede59
				
					0
					 × 1

Write your answer

1K Views

21 Answers

2 years ago