I still wonder how no one noticed ... (maybe 100 unique title/series report is relatively high threshold)
why doesn't this happen on my other experiments?
same 100+ reports ?
(My new theory is that calling Task.reload() will fix it, and it might be called internally for the other experiments, like when reporting models/artifacts)
Could that be the case ?
Okay I found the issue ( I think),
If the images are reported very quickly, it will "decide" you are about to override the previous one (i.e. 101 -> overwriting 0, which makes sense, the bug was it would disable the 101 from uploading and not the 0 🙂 )
Test fix:
in /backend_interface/metrics/events.py
, line 292, change:last_count = self._get_metric_count(self.metric, self.variant, next=False) if abs(self._count - last_count) > int(self._file_history_size): output = None elif isinstance(self._image_data, (six.StringIO, six.BytesIO)):
toif isinstance(self._image_data, (six.StringIO, six.BytesIO)):
MuddySquid7 I might have found something, and this is very very odd, it seems it will Not upload any new images post the history size, which is very odd considering the number of users actively using this feature...
Do you want to try a hack to see if it solved your issue ?
I'm afraid I'm still having the same issue..
I see the correct confusion matrices in tensorboard
it's very odd for me too, I have another project running trainings longer that 100 epochs and I don't have this issue
oh...so is this a bug?
It was always a bug, only an elusive one 😉
Anyhow, I'll make sure we push a fix to GitHub, an RC is planned for later this week, it will contain it
Okay I found it, this is due to the fact the newer versions are sending the events/images in a subprocess (it used to be a thread).
The creation of the object is done on he main process, updating file index (round robin manner), but the check itself, happens on the subprocess., which is not "aware" of the used indexes (i.e. it is always 0, hence when exceeding the history side, it skips it)
I'm plotting the confusion matrices the regular way, plot, then read figure from buffer to create the tensor, and save the tensor
oh wait, I was using clearml == 0.17.5 and I also had this issue
I don't understand though..why doesn't this happen on my other experiments?
don't think so, I'm saving the model at the end of each epoch
MuddySquid7 the fix was pushed to GitHub, you can now install directly from the repo:pip install git+
I was using clearml == 0.17.5 and I also had this issue
I think it was introduced when we moved to subprocess reporting, with 0.17.5
You can disable it with the following in clearml.conf:sdk.development.report_use_subprocess = false
how quick is "very quickly"? we are talking about maybe 30 minutes to reach 100 epochs
From creating the event to actually sending it ... 30 min sounds like enough "time"...
MuddySquid7 you mean you are creating them with TB ? or are you uploading them as debug images ?
Specifically in the ClearML UI, do you have it under "plots" tab or "debug samples" tab ?
the issue is that the confusion matrix showing for epoch 101 is in fact the one for epoch 1.
The images are stored in the default files server
So the TB issue was reported images were not logged.
We are now talking about the caching, which is actually a UI thing which clearml-server version are you using ?
And where are the images stored (the default files server or is it S3/GS etc.) ?
I'm creating them for tensorboard yes, and they appear under the debug samples
tab
Hi MuddySquid7 issue is verified, v1.1.1 will be released in a few hours with a fix.
Thank you for noticing!