Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hello, I Am Training Some Models With Yolov8 And Want To Upload The Metrics To The Clearml Webpage In. However, Sometimes It Works And Sometimes It Does Not Work. Clearml Is Able To Read Everything From The Console And Stuff Like That, But Is Not Able To

Hello, I am training some models with yolov8 and want to upload the metrics to the clearml webpage in. However, sometimes it works and sometimes it does not work. Clearml is able to read everything from the console and stuff like that, but is not able to create plots on the scalar tab, such as precision, recall mAP and so on. It is stuck at iteration 0. Also when the training is completed it says it is still running in the webpage. I was wondering what I can do to fix this? I read online that there could be some issues if tensorboard is not installed, however it still does not work with tensorboard installed.

  
  
Posted one year ago
Votes Newest

Answers 21


@<1523701070390366208:profile|CostlyOstrich36> Do you know what potentially is the problem?

  
  
Posted one year ago

@<1558986839216361472:profile|FuzzyCentipede59> Would you mind sharing how you're running the training? i.e. a minimal code example so we can reproduce the issue?

  
  
Posted one year ago

An update: using your code (the snippet above) I was getting no scalars when simply installing ultralytics and clearml packages using pip. Because indeed tensorboard is not installed. When I do install tensorboard, I get metrics in like normal, so I can't seem to reproduce the issue when tensorboard is correctly installed. That said, maybe we should look at not having this dependency 🤔

Would you mind posting a pip freeze of your environment that you're using to run yolo?

  
  
Posted one year ago

I will do that!

  
  
Posted one year ago

And at the menu it says it is at iteration 2 even though the console log in the webpage says it is at epoch 8
image
image

  
  
Posted one year ago

I use the implementation of yolov8:

None

  
  
Posted one year ago

I'm still struggling to reproduce the issue. Trying on my own PC locally as well as on google colab yields nothing.

The fact that you do get tensorboard logs, but none of them are captured by ClearML means there might be something wrong with our tensorboard bindings, but it's hard to pinpoint exactly what if I can't get it to fail like yours 😅 Let me try and instal exactly your environment using your packages above. Which python version are you using?

  
  
Posted one year ago

The first image shows how it should look like, however in the second image the model is actually training on the 7th epoch but the scalars are not updated, they are just stuck on iteration 0
image
image

  
  
Posted one year ago

@<1523701118159294464:profile|ExasperatedCrab78> Sure! Here is my train file:

from ultralytics import YOLO

# Load a model
model = YOLO(model="yolov8m.pt")  # load a pretrained model (recommended for training)

# Train the model
model.train(
    data="data.yaml",
    epochs=200,
    imgsz=640,
    label_smoothing=0.1,
    shear=0.01,
    perspective=0.0001,
    mosaic=0.5,
    mixup=0.1,
)

and here is from the source code for yolov8

# Ultralytics YOLO :rocket:, GPL-3.0 license
import re

import matplotlib.image as mpimg
import matplotlib.pyplot as plt

from ultralytics.yolo.utils import LOGGER, TESTS_RUNNING
from ultralytics.yolo.utils.torch_utils import get_flops, get_num_params

try:
    import clearml
    from clearml import Task
    from clearml.binding.frameworks.pytorch_bind import PatchPyTorchModelIO
    from clearml.binding.matplotlib_bind import PatchedMatplotlib

    assert hasattr(clearml, '__version__')  # verify package is not directory
    assert not TESTS_RUNNING  # do not log pytest
except (ImportError, AssertionError):
    clearml = None


def _log_debug_samples(files, title='Debug Samples'):
    """
        Log files (images) as debug samples in the ClearML task.

        arguments:
        files (List(PosixPath)) a list of file paths in PosixPath format
        title (str) A title that groups together images with the same values
        """
    task = Task.current_task()
    if task:
        for f in files:
            if f.exists():
                it = re.search(r'_batch(\d+)', f.name)
                iteration = int(it.groups()[0]) if it else 0
                task.get_logger().report_image(title=title,
                                               series=f.name.replace(it.group(), ''),
                                               local_path=str(f),
                                               iteration=iteration)


def _log_plot(title, plot_path):
    """
        Log image as plot in the plot section of ClearML

        arguments:
        title (str) Title of the plot
        plot_path (PosixPath or str) Path to the saved image file
        """
    img = mpimg.imread(plot_path)
    fig = plt.figure()
    ax = fig.add_axes([0, 0, 1, 1], frameon=False, aspect='auto', xticks=[], yticks=[])  # no ticks
    ax.imshow(img)

    Task.current_task().get_logger().report_matplotlib_figure(title, '', figure=fig, report_interactive=False)


def on_pretrain_routine_start(trainer):
    try:
        task = Task.current_task()
        if task:
            # Make sure the automatic pytorch and matplotlib bindings are disabled!
            # We are logging these plots and model files manually in the integration
            PatchPyTorchModelIO.update_current_task(None)
            PatchedMatplotlib.update_current_task(None)
        else:
            task = Task.init(project_name=trainer.args.project or 'YOLOv8',
                             task_name=trainer.args.name,
                             tags=['YOLOv8'],
                             output_uri=True,
                             reuse_last_task_id=False,
                             auto_connect_frameworks={
                                 'pytorch': False,
                                 'matplotlib': False})
            LOGGER.warning('ClearML Initialized a new task. If you want to run remotely, '
                           'please add clearml-init and connect your arguments before initializing YOLO.')
        task.connect(vars(trainer.args), name='General')
    except Exception as e:
        LOGGER.warning(f'WARNING ⚠️ ClearML installed but not initialized correctly, not logging this run. {e}')


def on_train_epoch_end(trainer):
    if trainer.epoch == 1 and Task.current_task():
        _log_debug_samples(sorted(trainer.save_dir.glob('train_batch*.jpg')), 'Mosaic')


def on_fit_epoch_end(trainer):
    task = Task.current_task()
    if task:
        # You should have access to the validation bboxes under jdict
        task.get_logger().report_scalar(title='Epoch Time',
                                        series='Epoch Time',
                                        value=trainer.epoch_time,
                                        iteration=trainer.epoch)
        if trainer.epoch == 0:
            model_info = {
                'model/parameters': get_num_params(trainer.model),
                'model/GFLOPs': round(get_flops(trainer.model), 3),
                'model/speed(ms)': round(trainer.validator.speed['inference'], 3)}
            for k, v in model_info.items():
                task.get_logger().report_single_value(k, v)


def on_val_end(validator):
    if Task.current_task():
        # Log val_labels and val_pred
        _log_debug_samples(sorted(validator.save_dir.glob('val*.jpg')), 'Validation')


def on_train_end(trainer):
    task = Task.current_task()
    if task:
        # Log final results, CM matrix + PR plots
        files = ['results.png', 'confusion_matrix.png', *(f'{x}_curve.png' for x in ('F1', 'PR', 'P', 'R'))]
        files = [(trainer.save_dir / f) for f in files if (trainer.save_dir / f).exists()]  # filter
        for f in files:
            _log_plot(title=f.stem, plot_path=f)
        # Report final metrics
        for k, v in trainer.validator.metrics.results_dict.items():
            task.get_logger().report_single_value(k, v)
        # Log the final model
        task.update_output_model(model_path=str(trainer.best), model_name=trainer.args.name, auto_delete_file=False)


callbacks = {
    'on_pretrain_routine_start': on_pretrain_routine_start,
    'on_train_epoch_end': on_train_epoch_end,
    'on_fit_epoch_end': on_fit_epoch_end,
    'on_val_end': on_val_end,
    'on_train_end': on_train_end} if clearml else {}
  
  
Posted one year ago

@<1523701118159294464:profile|ExasperatedCrab78> Hey, I deleted the virtual environment and created a new one with python 3.9 and the necessary dependencies and now it seems to work! 😄 Thanks for your help! Maybe there were some packages interrupting or something with the python 3.8 version

  
  
Posted one year ago

Based on the screenshot of you package versions, it does seem like tensorboard is not installed there. We depend on that, because every scalar logged to tensorboard is captured in ClearML too. My guess would be that maybe you installed tensorboard in e.g. the wrong virtualenv.

However, you do say you tested it with Tensorboard and even then it didn't work. In that case, are the scalars correctly logged to tensorboard? You should be able to easily check this by doing a run, and then launching tensorboard to see if the scalars are coming in. If they are, but ClearML is not receiving them, it's probably a bug.

Would you mind checking that? In the meantime, I'll launch a version of my own with your script and see if I get the same issues

  
  
Posted one year ago

One more thing: are you running the snippet inside a jupyter notebook (Wondering this because you have Jupyter in your environment)

  
  
Posted one year ago

On my local machine it works also actually, there I am using python 3.9. The issue happens when training on the GPU Cluster for my university, there I am using python 3.8.2. I will try to create a new virtual environment with python 3.10.4 and see if it works then 🙂

  
  
Posted one year ago

What version of ClearML are you using?

  
  
Posted one year ago

It says 1.10.3
image

  
  
Posted one year ago

Thank you for your assistance! These are the plugins installed in the environment:

absl-py==1.4.0
aiofiles==22.1.0
aiosqlite==0.18.0
anyio==3.6.2
argon2-cffi==21.3.0
argon2-cffi-bindings==21.2.0
arrow==1.2.3
asttokens==2.2.1
attrs==22.2.0
Babel==2.12.1
backcall==0.2.0
beautifulsoup4==4.12.2
bleach==6.0.0
cachetools==5.3.0
certifi==2022.12.7
cffi==1.15.1
charset-normalizer==3.1.0
clearml==1.10.3
cmake==3.26.1
comm==0.1.3
contourpy==1.0.7
cycler==0.11.0
debugpy==1.6.7
decorator==5.1.1
defusedxml==0.7.1
executing==1.2.0
fastjsonschema==2.16.3
filelock==3.11.0
fonttools==4.39.3
fqdn==1.5.1
furl==2.1.3
google-auth==2.17.2
google-auth-oauthlib==1.0.0
grpcio==1.53.0
htmlmin==0.1.12
idna==3.4
ImageHash==4.3.1
importlib-metadata==6.2.0
importlib-resources==5.12.0
install==1.3.5
ipykernel==6.22.0
ipython==8.12.0
ipython-genutils==0.2.0
ipywidgets==8.0.6
isoduration==20.11.0
jedi==0.18.2
Jinja2==3.1.2
joblib==1.2.0
json5==0.9.11
jsonpointer==2.3
jsonschema==4.17.3
jupyter==1.0.0
jupyter-console==6.6.3
jupyter-events==0.6.3
jupyter-ydoc==0.2.3
jupyter_client==8.1.0
jupyter_core==5.3.0
jupyter_server==2.5.0
jupyter_server_fileid==0.8.0
jupyter_server_terminals==0.4.4
jupyter_server_ydoc==0.8.0
jupyterlab==3.6.3
jupyterlab-pygments==0.2.2
jupyterlab-widgets==3.0.7
jupyterlab_server==2.22.0
kiwisolver==1.4.4
lit==16.0.0
Markdown==3.4.3
MarkupSafe==2.1.2
matplotlib==3.6.3
matplotlib-inline==0.1.6
mistune==2.0.5
mpmath==1.3.0
multimethod==1.9.1
nbclassic==0.5.5
nbclient==0.7.3
nbconvert==7.3.0
nbformat==5.8.0
nest-asyncio==1.5.6
networkx==3.1
notebook==6.5.4
notebook_shim==0.2.2
numpy==1.23.5
nvidia-cublas-cu11==11.10.3.66
nvidia-cuda-cupti-cu11==11.7.101
nvidia-cuda-nvrtc-cu11==11.7.99
nvidia-cuda-runtime-cu11==11.7.99
nvidia-cudnn-cu11==8.5.0.96
nvidia-cufft-cu11==10.9.0.58
nvidia-curand-cu11==10.2.10.91
nvidia-cusolver-cu11==11.4.0.1
nvidia-cusparse-cu11==11.7.4.91
nvidia-nccl-cu11==2.14.3
nvidia-nvtx-cu11==11.7.91
oauthlib==3.2.2
opencv-python==4.7.0.72
orderedmultidict==1.0.1
packaging==23.0
pandas==1.5.3
pandas-profiling==3.6.6
pandocfilters==1.5.0
parso==0.8.3
pathlib2==2.3.7.post1
patsy==0.5.3
pexpect==4.8.0
phik==0.12.3
pickleshare==0.7.5
Pillow==9.5.0
pkgutil_resolve_name==1.3.10
platformdirs==3.2.0
prometheus-client==0.16.0
prompt-toolkit==3.0.38
protobuf==4.22.1
psutil==5.9.4
ptyprocess==0.7.0
pure-eval==0.2.2
pyasn1==0.4.8
pyasn1-modules==0.2.8
pycparser==2.21
pydantic==1.10.7
Pygments==2.14.0
PyJWT==2.4.0
pyparsing==3.0.9
pyrsistent==0.19.3
python-dateutil==2.8.2
python-json-logger==2.0.7
pytz==2023.3
PyWavelets==1.4.1
PyYAML==6.0
pyzmq==25.0.2
qtconsole==5.4.2
QtPy==2.3.1
requests==2.28.2
requests-oauthlib==1.3.1
rfc3339-validator==0.1.4
rfc3986-validator==0.1.1
rsa==4.9
scipy==1.9.3
seaborn==0.12.2
Send2Trash==1.8.0
sentry-sdk==1.19.1
six==1.16.0
sniffio==1.3.0
soupsieve==2.4
stack-data==0.6.2
statsmodels==0.13.5
sympy==1.11.1
tangled-up-in-unicode==0.2.0
tensorboard==2.12.1
tensorboard-data-server==0.7.0
tensorboard-plugin-wit==1.8.1
terminado==0.17.1
thop==0.1.1.post2209072238
tinycss2==1.2.1
tomli==2.0.1
torch==2.0.0
torchvision==0.15.1
tornado==6.2
tqdm==4.64.1
traitlets==5.9.0
triton==2.0.0
typeguard==2.13.3
typing_extensions==4.5.0
tzdata==2023.3
ultralytics==8.0.71
uri-template==1.2.0
urllib3==1.26.15
visions==0.7.5
wcwidth==0.2.6
webcolors==1.13
webencodings==0.5.1
websocket-client==1.5.1
Werkzeug==2.2.3
widgetsnbextension==4.0.7
y-py==0.5.9
ydata-profiling==4.1.2
ypy-websocket==0.8.2
zipp==3.15.0

  
  
Posted one year ago

I am not running in notebook

  
  
Posted one year ago

However, I get these warnings:

TensorFlow installation not found - running with reduced feature set.
/cluster/home/project_tdt4265/.venv/lib/python3.8/site-packages/tensorboard_data_server/bin/server: /lib64/libc.so.6: version GLIBC_2.29' not found (required by /cluster/home/project_tdt4265/.venv/lib/python3.8/site-packages/tensorboard_data_server/bin/server) /cluster/home/haakobh/project_tdt4265/.venv/lib/python3.8/site-packages/tensorboard_data_server/bin/server: /lib64/libc.so.6: versionGLIBC_2.33' not found (required by /cluster/home/project_tdt4265/.venv/lib/python3.8/site-packages/tensorboard_data_server/bin/server)
/cluster/home/project_tdt4265/.venv/lib/python3.8/site-packages/tensorboard_data_server/bin/server: /lib64/libc.so.6: version GLIBC_2.34' not found (required by /cluster/home/project_tdt4265/.venv/lib/python3.8/site-packages/tensorboard_data_server/bin/server) /cluster/home/project_tdt4265/.venv/lib/python3.8/site-packages/tensorboard_data_server/bin/server: /lib64/libc.so.6: versionGLIBC_2.32' not found (required by /cluster/home/project_tdt4265/.venv/lib/python3.8/site-packages/tensorboard_data_server/bin/server)

  
  
Posted one year ago

Hi @<1558986839216361472:profile|FuzzyCentipede59> ! Can you share some snippets of your code, and tell us what you expect to see vs what you actually see is happening?

  
  
Posted one year ago

With tensorboard I get these plots:
image

  
  
Posted one year ago

Interesting! I'm glad to know it's working now, only I now really want to know what caused it 😄 Let me know if you ever do find out!

  
  
Posted one year ago
950 Views
21 Answers
one year ago
one year ago
Tags
Similar posts