Okay I found the issue ( I think),
If the images are reported very quickly, it will "decide" you are about to override the previous one (i.e. 101 -> overwriting 0, which makes sense, the bug was it would disable the 101 from uploading and not the 0 🙂 )
Test fix:
in /backend_interface/metrics/events.py
, line 292, change:last_count = self._get_metric_count(self.metric, self.variant, next=False) if abs(self._count - last_count) > int(self._file_history_size): output = None elif isinstance(self._image_data, (six.StringIO, six.BytesIO)):
toif isinstance(self._image_data, (six.StringIO, six.BytesIO)):
I'm plotting the confusion matrices the regular way, plot, then read figure from buffer to create the tensor, and save the tensor
I'm creating them for tensorboard yes, and they appear under the debug samples
tab
don't think so, I'm saving the model at the end of each epoch
I was using clearml == 0.17.5 and I also had this issue
I think it was introduced when we moved to subprocess reporting, with 0.17.5
You can disable it with the following in clearml.conf:sdk.development.report_use_subprocess = false
how quick is "very quickly"? we are talking about maybe 30 minutes to reach 100 epochs
From creating the event to actually sending it ... 30 min sounds like enough "time"...
why doesn't this happen on my other experiments?
same 100+ reports ?
(My new theory is that calling Task.reload() will fix it, and it might be called internally for the other experiments, like when reporting models/artifacts)
Could that be the case ?
I'm afraid I'm still having the same issue..
MuddySquid7 I might have found something, and this is very very odd, it seems it will Not upload any new images post the history size, which is very odd considering the number of users actively using this feature...
Do you want to try a hack to see if it solved your issue ?
Okay I found it, this is due to the fact the newer versions are sending the events/images in a subprocess (it used to be a thread).
The creation of the object is done on he main process, updating file index (round robin manner), but the check itself, happens on the subprocess., which is not "aware" of the used indexes (i.e. it is always 0, hence when exceeding the history side, it skips it)
I don't understand though..why doesn't this happen on my other experiments?
MuddySquid7 the fix was pushed to GitHub, you can now install directly from the repo:pip install git+
So the TB issue was reported images were not logged.
We are now talking about the caching, which is actually a UI thing which clearml-server version are you using ?
And where are the images stored (the default files server or is it S3/GS etc.) ?
I still wonder how no one noticed ... (maybe 100 unique title/series report is relatively high threshold)
I see the correct confusion matrices in tensorboard
the issue is that the confusion matrix showing for epoch 101 is in fact the one for epoch 1.
The images are stored in the default files server
oh wait, I was using clearml == 0.17.5 and I also had this issue
oh...so is this a bug?
It was always a bug, only an elusive one 😉
Anyhow, I'll make sure we push a fix to GitHub, an RC is planned for later this week, it will contain it
Hi MuddySquid7 issue is verified, v1.1.1 will be released in a few hours with a fix.
Thank you for noticing!
it's very odd for me too, I have another project running trainings longer that 100 epochs and I don't have this issue
MuddySquid7 you mean you are creating them with TB ? or are you uploading them as debug images ?
Specifically in the ClearML UI, do you have it under "plots" tab or "debug samples" tab ?