Interesting! I'm glad to know it's working now, only I now really want to know what caused it 😄 Let me know if you ever do find out!
@<1523701118159294464:profile|ExasperatedCrab78> Hey, I deleted the virtual environment and created a new one with python 3.9 and the necessary dependencies and now it seems to work! 😄 Thanks for your help! Maybe there were some packages interrupting or something with the python 3.8 version
On my local machine it works also actually, there I am using python 3.9. The issue happens when training on the GPU Cluster for my university, there I am using python 3.8.2. I will try to create a new virtual environment with python 3.10.4 and see if it works then 🙂
I'm still struggling to reproduce the issue. Trying on my own PC locally as well as on google colab yields nothing.
The fact that you do get tensorboard logs, but none of them are captured by ClearML means there might be something wrong with our tensorboard bindings, but it's hard to pinpoint exactly what if I can't get it to fail like yours 😅 Let me try and instal exactly your environment using your packages above. Which python version are you using?
One more thing: are you running the snippet inside a jupyter notebook (Wondering this because you have Jupyter in your environment)
However, I get these warnings:
TensorFlow installation not found - running with reduced feature set.
/cluster/home/project_tdt4265/.venv/lib/python3.8/site-packages/tensorboard_data_server/bin/server: /lib64/libc.so.6: version GLIBC_2.29' not found (required by /cluster/home/project_tdt4265/.venv/lib/python3.8/site-packages/tensorboard_data_server/bin/server) /cluster/home/haakobh/project_tdt4265/.venv/lib/python3.8/site-packages/tensorboard_data_server/bin/server: /lib64/libc.so.6: version
GLIBC_2.33' not found (required by /cluster/home/project_tdt4265/.venv/lib/python3.8/site-packages/tensorboard_data_server/bin/server)
/cluster/home/project_tdt4265/.venv/lib/python3.8/site-packages/tensorboard_data_server/bin/server: /lib64/libc.so.6: version GLIBC_2.34' not found (required by /cluster/home/project_tdt4265/.venv/lib/python3.8/site-packages/tensorboard_data_server/bin/server) /cluster/home/project_tdt4265/.venv/lib/python3.8/site-packages/tensorboard_data_server/bin/server: /lib64/libc.so.6: version
GLIBC_2.32' not found (required by /cluster/home/project_tdt4265/.venv/lib/python3.8/site-packages/tensorboard_data_server/bin/server)
Thank you for your assistance! These are the plugins installed in the environment:
absl-py==1.4.0
aiofiles==22.1.0
aiosqlite==0.18.0
anyio==3.6.2
argon2-cffi==21.3.0
argon2-cffi-bindings==21.2.0
arrow==1.2.3
asttokens==2.2.1
attrs==22.2.0
Babel==2.12.1
backcall==0.2.0
beautifulsoup4==4.12.2
bleach==6.0.0
cachetools==5.3.0
certifi==2022.12.7
cffi==1.15.1
charset-normalizer==3.1.0
clearml==1.10.3
cmake==3.26.1
comm==0.1.3
contourpy==1.0.7
cycler==0.11.0
debugpy==1.6.7
decorator==5.1.1
defusedxml==0.7.1
executing==1.2.0
fastjsonschema==2.16.3
filelock==3.11.0
fonttools==4.39.3
fqdn==1.5.1
furl==2.1.3
google-auth==2.17.2
google-auth-oauthlib==1.0.0
grpcio==1.53.0
htmlmin==0.1.12
idna==3.4
ImageHash==4.3.1
importlib-metadata==6.2.0
importlib-resources==5.12.0
install==1.3.5
ipykernel==6.22.0
ipython==8.12.0
ipython-genutils==0.2.0
ipywidgets==8.0.6
isoduration==20.11.0
jedi==0.18.2
Jinja2==3.1.2
joblib==1.2.0
json5==0.9.11
jsonpointer==2.3
jsonschema==4.17.3
jupyter==1.0.0
jupyter-console==6.6.3
jupyter-events==0.6.3
jupyter-ydoc==0.2.3
jupyter_client==8.1.0
jupyter_core==5.3.0
jupyter_server==2.5.0
jupyter_server_fileid==0.8.0
jupyter_server_terminals==0.4.4
jupyter_server_ydoc==0.8.0
jupyterlab==3.6.3
jupyterlab-pygments==0.2.2
jupyterlab-widgets==3.0.7
jupyterlab_server==2.22.0
kiwisolver==1.4.4
lit==16.0.0
Markdown==3.4.3
MarkupSafe==2.1.2
matplotlib==3.6.3
matplotlib-inline==0.1.6
mistune==2.0.5
mpmath==1.3.0
multimethod==1.9.1
nbclassic==0.5.5
nbclient==0.7.3
nbconvert==7.3.0
nbformat==5.8.0
nest-asyncio==1.5.6
networkx==3.1
notebook==6.5.4
notebook_shim==0.2.2
numpy==1.23.5
nvidia-cublas-cu11==11.10.3.66
nvidia-cuda-cupti-cu11==11.7.101
nvidia-cuda-nvrtc-cu11==11.7.99
nvidia-cuda-runtime-cu11==11.7.99
nvidia-cudnn-cu11==8.5.0.96
nvidia-cufft-cu11==10.9.0.58
nvidia-curand-cu11==10.2.10.91
nvidia-cusolver-cu11==11.4.0.1
nvidia-cusparse-cu11==11.7.4.91
nvidia-nccl-cu11==2.14.3
nvidia-nvtx-cu11==11.7.91
oauthlib==3.2.2
opencv-python==4.7.0.72
orderedmultidict==1.0.1
packaging==23.0
pandas==1.5.3
pandas-profiling==3.6.6
pandocfilters==1.5.0
parso==0.8.3
pathlib2==2.3.7.post1
patsy==0.5.3
pexpect==4.8.0
phik==0.12.3
pickleshare==0.7.5
Pillow==9.5.0
pkgutil_resolve_name==1.3.10
platformdirs==3.2.0
prometheus-client==0.16.0
prompt-toolkit==3.0.38
protobuf==4.22.1
psutil==5.9.4
ptyprocess==0.7.0
pure-eval==0.2.2
pyasn1==0.4.8
pyasn1-modules==0.2.8
pycparser==2.21
pydantic==1.10.7
Pygments==2.14.0
PyJWT==2.4.0
pyparsing==3.0.9
pyrsistent==0.19.3
python-dateutil==2.8.2
python-json-logger==2.0.7
pytz==2023.3
PyWavelets==1.4.1
PyYAML==6.0
pyzmq==25.0.2
qtconsole==5.4.2
QtPy==2.3.1
requests==2.28.2
requests-oauthlib==1.3.1
rfc3339-validator==0.1.4
rfc3986-validator==0.1.1
rsa==4.9
scipy==1.9.3
seaborn==0.12.2
Send2Trash==1.8.0
sentry-sdk==1.19.1
six==1.16.0
sniffio==1.3.0
soupsieve==2.4
stack-data==0.6.2
statsmodels==0.13.5
sympy==1.11.1
tangled-up-in-unicode==0.2.0
tensorboard==2.12.1
tensorboard-data-server==0.7.0
tensorboard-plugin-wit==1.8.1
terminado==0.17.1
thop==0.1.1.post2209072238
tinycss2==1.2.1
tomli==2.0.1
torch==2.0.0
torchvision==0.15.1
tornado==6.2
tqdm==4.64.1
traitlets==5.9.0
triton==2.0.0
typeguard==2.13.3
typing_extensions==4.5.0
tzdata==2023.3
ultralytics==8.0.71
uri-template==1.2.0
urllib3==1.26.15
visions==0.7.5
wcwidth==0.2.6
webcolors==1.13
webencodings==0.5.1
websocket-client==1.5.1
Werkzeug==2.2.3
widgetsnbextension==4.0.7
y-py==0.5.9
ydata-profiling==4.1.2
ypy-websocket==0.8.2
zipp==3.15.0
An update: using your code (the snippet above) I was getting no scalars when simply installing ultralytics and clearml packages using pip. Because indeed tensorboard is not installed. When I do install tensorboard, I get metrics in like normal, so I can't seem to reproduce the issue when tensorboard is correctly installed. That said, maybe we should look at not having this dependency 🤔
Would you mind posting a pip freeze of your environment that you're using to run yolo?
Based on the screenshot of you package versions, it does seem like tensorboard is not installed there. We depend on that, because every scalar logged to tensorboard is captured in ClearML too. My guess would be that maybe you installed tensorboard in e.g. the wrong virtualenv.
However, you do say you tested it with Tensorboard and even then it didn't work. In that case, are the scalars correctly logged to tensorboard? You should be able to easily check this by doing a run, and then launching tensorboard to see if the scalars are coming in. If they are, but ClearML is not receiving them, it's probably a bug.
Would you mind checking that? In the meantime, I'll launch a version of my own with your script and see if I get the same issues
@<1523701118159294464:profile|ExasperatedCrab78> Sure! Here is my train file:
from ultralytics import YOLO
# Load a model
model = YOLO(model="yolov8m.pt") # load a pretrained model (recommended for training)
# Train the model
model.train(
data="data.yaml",
epochs=200,
imgsz=640,
label_smoothing=0.1,
shear=0.01,
perspective=0.0001,
mosaic=0.5,
mixup=0.1,
)
and here is from the source code for yolov8
# Ultralytics YOLO :rocket:, GPL-3.0 license
import re
import matplotlib.image as mpimg
import matplotlib.pyplot as plt
from ultralytics.yolo.utils import LOGGER, TESTS_RUNNING
from ultralytics.yolo.utils.torch_utils import get_flops, get_num_params
try:
import clearml
from clearml import Task
from clearml.binding.frameworks.pytorch_bind import PatchPyTorchModelIO
from clearml.binding.matplotlib_bind import PatchedMatplotlib
assert hasattr(clearml, '__version__') # verify package is not directory
assert not TESTS_RUNNING # do not log pytest
except (ImportError, AssertionError):
clearml = None
def _log_debug_samples(files, title='Debug Samples'):
"""
Log files (images) as debug samples in the ClearML task.
arguments:
files (List(PosixPath)) a list of file paths in PosixPath format
title (str) A title that groups together images with the same values
"""
task = Task.current_task()
if task:
for f in files:
if f.exists():
it = re.search(r'_batch(\d+)', f.name)
iteration = int(it.groups()[0]) if it else 0
task.get_logger().report_image(title=title,
series=f.name.replace(it.group(), ''),
local_path=str(f),
iteration=iteration)
def _log_plot(title, plot_path):
"""
Log image as plot in the plot section of ClearML
arguments:
title (str) Title of the plot
plot_path (PosixPath or str) Path to the saved image file
"""
img = mpimg.imread(plot_path)
fig = plt.figure()
ax = fig.add_axes([0, 0, 1, 1], frameon=False, aspect='auto', xticks=[], yticks=[]) # no ticks
ax.imshow(img)
Task.current_task().get_logger().report_matplotlib_figure(title, '', figure=fig, report_interactive=False)
def on_pretrain_routine_start(trainer):
try:
task = Task.current_task()
if task:
# Make sure the automatic pytorch and matplotlib bindings are disabled!
# We are logging these plots and model files manually in the integration
PatchPyTorchModelIO.update_current_task(None)
PatchedMatplotlib.update_current_task(None)
else:
task = Task.init(project_name=trainer.args.project or 'YOLOv8',
task_name=trainer.args.name,
tags=['YOLOv8'],
output_uri=True,
reuse_last_task_id=False,
auto_connect_frameworks={
'pytorch': False,
'matplotlib': False})
LOGGER.warning('ClearML Initialized a new task. If you want to run remotely, '
'please add clearml-init and connect your arguments before initializing YOLO.')
task.connect(vars(trainer.args), name='General')
except Exception as e:
LOGGER.warning(f'WARNING ⚠️ ClearML installed but not initialized correctly, not logging this run. {e}')
def on_train_epoch_end(trainer):
if trainer.epoch == 1 and Task.current_task():
_log_debug_samples(sorted(trainer.save_dir.glob('train_batch*.jpg')), 'Mosaic')
def on_fit_epoch_end(trainer):
task = Task.current_task()
if task:
# You should have access to the validation bboxes under jdict
task.get_logger().report_scalar(title='Epoch Time',
series='Epoch Time',
value=trainer.epoch_time,
iteration=trainer.epoch)
if trainer.epoch == 0:
model_info = {
'model/parameters': get_num_params(trainer.model),
'model/GFLOPs': round(get_flops(trainer.model), 3),
'model/speed(ms)': round(trainer.validator.speed['inference'], 3)}
for k, v in model_info.items():
task.get_logger().report_single_value(k, v)
def on_val_end(validator):
if Task.current_task():
# Log val_labels and val_pred
_log_debug_samples(sorted(validator.save_dir.glob('val*.jpg')), 'Validation')
def on_train_end(trainer):
task = Task.current_task()
if task:
# Log final results, CM matrix + PR plots
files = ['results.png', 'confusion_matrix.png', *(f'{x}_curve.png' for x in ('F1', 'PR', 'P', 'R'))]
files = [(trainer.save_dir / f) for f in files if (trainer.save_dir / f).exists()] # filter
for f in files:
_log_plot(title=f.stem, plot_path=f)
# Report final metrics
for k, v in trainer.validator.metrics.results_dict.items():
task.get_logger().report_single_value(k, v)
# Log the final model
task.update_output_model(model_path=str(trainer.best), model_name=trainer.args.name, auto_delete_file=False)
callbacks = {
'on_pretrain_routine_start': on_pretrain_routine_start,
'on_train_epoch_end': on_train_epoch_end,
'on_fit_epoch_end': on_fit_epoch_end,
'on_val_end': on_val_end,
'on_train_end': on_train_end} if clearml else {}
@<1558986839216361472:profile|FuzzyCentipede59> Would you mind sharing how you're running the training? i.e. a minimal code example so we can reproduce the issue?
@<1523701070390366208:profile|CostlyOstrich36> Do you know what potentially is the problem?
I use the implementation of yolov8:
And at the menu it says it is at iteration 2 even though the console log in the webpage says it is at epoch 8
The first image shows how it should look like, however in the second image the model is actually training on the 7th epoch but the scalars are not updated, they are just stuck on iteration 0
Hi @<1558986839216361472:profile|FuzzyCentipede59> ! Can you share some snippets of your code, and tell us what you expect to see vs what you actually see is happening?