Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hi! I Need Help Debugging The Following Issue Please. I'M Training A Cnn And Plotting The Confusion Matrices For Train And Val In Each Epoch. When I Get To Epoch 101, The Ui Kind Of Breaks..It Starts Showing Me The Images For Epoch 1. When I Right Click O

Hi! I need help debugging the following issue please. I'm training a CNN and plotting the confusion matrices for train and val in each epoch.
When I get to epoch 101, the UI kind of breaks..it starts showing me the images for epoch 1. When I right click on top of the image to see what is the url, the image for epoch 1 and 101 have the same url. Has anyone had a similar issue?

  
  
Posted 3 years ago
Votes Newest

Answers 30


great! thank you for such a quick response!

  
  
Posted 3 years ago

I still wonder how no one noticed ... (maybe 100 unique title/series report is relatively high threshold)

  
  
Posted 3 years ago

oh...so is this a bug?

It was always a bug, only an elusive one 😉
Anyhow, I'll make sure we push a fix to GitHub, an RC is planned for later this week, it will contain it

  
  
Posted 3 years ago

I'm creating them for tensorboard yes, and they appear under the debug samples tab

  
  
Posted 3 years ago

oh...so is this a bug?

  
  
Posted 3 years ago

now it's working!

  
  
Posted 3 years ago

why doesn't this happen on my other experiments?

same 100+ reports ?
(My new theory is that calling Task.reload() will fix it, and it might be called internally for the other experiments, like when reporting models/artifacts)
Could that be the case ?

  
  
Posted 3 years ago

how quick is "very quickly"? we are talking about maybe 30 minutes to reach 100 epochs

  
  
Posted 3 years ago

Okay I found it, this is due to the fact the newer versions are sending the events/images in a subprocess (it used to be a thread).
The creation of the object is done on he main process, updating file index (round robin manner), but the check itself, happens on the subprocess., which is not "aware" of the used indexes (i.e. it is always 0, hence when exceeding the history side, it skips it)

  
  
Posted 3 years ago

I see the correct confusion matrices in tensorboard

  
  
Posted 3 years ago

I'm afraid I'm still having the same issue..

  
  
Posted 3 years ago

I was using clearml == 0.17.5 and I also had this issue

I think it was introduced when we moved to subprocess reporting, with 0.17.5
You can disable it with the following in clearml.conf:
sdk.development.report_use_subprocess = false

  
  
Posted 3 years ago

Hi MuddySquid7 issue is verified, v1.1.1 will be released in a few hours with a fix.
Thank you for noticing!

  
  
Posted 3 years ago

don't think so, I'm saving the model at the end of each epoch

  
  
Posted 3 years ago

it's very odd for me too, I have another project running trainings longer that 100 epochs and I don't have this issue

  
  
Posted 3 years ago

Awesome! I'll let you know if it works now

  
  
Posted 3 years ago

I don't understand though..why doesn't this happen on my other experiments?

  
  
Posted 3 years ago

So the TB issue was reported images were not logged.
We are now talking about the caching, which is actually a UI thing which clearml-server version are you using ?
And where are the images stored (the default files server or is it S3/GS etc.) ?

  
  
Posted 3 years ago

MuddySquid7 the fix was pushed to GitHub, you can now install directly from the repo:
pip install git+

  
  
Posted 3 years ago

the issue is that the confusion matrix showing for epoch 101 is in fact the one for epoch 1.
The images are stored in the default files server

  
  
Posted 3 years ago

MuddySquid7 you mean you are creating them with TB ? or are you uploading them as debug images ?
Specifically in the ClearML UI, do you have it under "plots" tab or "debug samples" tab ?

  
  
Posted 3 years ago

I need to wait 100 epochs 😅

  
  
Posted 3 years ago

I'm plotting the confusion matrices the regular way, plot, then read figure from buffer to create the tensor, and save the tensor

  
  
Posted 3 years ago

Okay I found the issue ( I think),
If the images are reported very quickly, it will "decide" you are about to override the previous one (i.e. 101 -> overwriting 0, which makes sense, the bug was it would disable the 101 from uploading and not the 0 🙂 )
Test fix:
in /backend_interface/metrics/events.py , line 292, change:
last_count = self._get_metric_count(self.metric, self.variant, next=False) if abs(self._count - last_count) > int(self._file_history_size): output = None elif isinstance(self._image_data, (six.StringIO, six.BytesIO)):to
if isinstance(self._image_data, (six.StringIO, six.BytesIO)):

  
  
Posted 3 years ago

From creating the event to actually sending it ... 30 min sounds like enough "time"...

  
  
Posted 3 years ago

thanks!!

  
  
Posted 3 years ago

yes please 🙂

  
  
Posted 3 years ago

MuddySquid7 I might have found something, and this is very very odd, it seems it will Not upload any new images post the history size, which is very odd considering the number of users actively using this feature...
Do you want to try a hack to see if it solved your issue ?

  
  
Posted 3 years ago

oh wait, I was using clearml == 0.17.5 and I also had this issue

  
  
Posted 3 years ago