Hi! I Need Help Debugging The Following Issue Please. I'M Training A Cnn And Plotting The Confusion Matrices For Train And Val In Each Epoch. When I Get To Epoch 101, The Ui Kind Of Breaks..It Starts Showing Me The Images For Epoch 1. When I Right Click O

Answered

Hi! I need help debugging the following issue please. I'm training a CNN and plotting the confusion matrices for train and val in each epoch.
When I get to epoch 101, the UI kind of breaks..it starts showing me the images for epoch 1. When I right click on top of the image to see what is the url, the image for epoch 1 and 101 have the same url. Has anyone had a similar issue?

  				
Posted 
	3 years ago

					More  		
  Report
		
					MuddySquid7
				
					0
					 × 1

Votes Newest

Answers 30

I still wonder how no one noticed ... (maybe 100 unique title/series report is relatively high threshold)

  				
Posted 
	3 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

I'm plotting the confusion matrices the regular way, plot, then read figure from buffer to create the tensor, and save the tensor

  				
Posted 
	3 years ago

					More  		
  Report
		
					MuddySquid7
				
					0
					 × 1

Awesome! I'll let you know if it works now

  				
Posted 
	3 years ago

					More  		
  Report
		
					MuddySquid7
				
					0
					 × 1

I was using clearml == 0.17.5 and I also had this issue

I think it was introduced when we moved to subprocess reporting, with 0.17.5
You can disable it with the following in clearml.conf:
sdk.development.report_use_subprocess = false

  				
Posted 
	3 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

I see the correct confusion matrices in tensorboard

  				
Posted 
	3 years ago

					More  		
  Report
		
					MuddySquid7
				
					0
					 × 1

So the TB issue was reported images were not logged.
We are now talking about the caching, which is actually a UI thing which clearml-server version are you using ?
And where are the images stored (the default files server or is it S3/GS etc.) ?

  				
Posted 
	3 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

great! thank you for such a quick response!

  				
Posted 
	3 years ago

					More  		
  Report
		
					MuddySquid7
				
					0
					 × 1

don't think so, I'm saving the model at the end of each epoch

  				
Posted 
	3 years ago

					More  		
  Report
		
					MuddySquid7
				
					0
					 × 1

MuddySquid7 the fix was pushed to GitHub, you can now install directly from the repo:
pip install git+

  				
Posted 
	3 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

oh...so is this a bug?

It was always a bug, only an elusive one 😉
Anyhow, I'll make sure we push a fix to GitHub, an RC is planned for later this week, it will contain it

  				
Posted 
	3 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

why doesn't this happen on my other experiments?

same 100+ reports ?
(My new theory is that calling Task.reload() will fix it, and it might be called internally for the other experiments, like when reporting models/artifacts)
Could that be the case ?

  				
Posted 
	3 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

now it's working!

  				
Posted 
	3 years ago

					More  		
  Report
		
					MuddySquid7
				
					0
					 × 1

it's very odd for me too, I have another project running trainings longer that 100 epochs and I don't have this issue

  				
Posted 
	3 years ago

					More  		
  Report
		
					MuddySquid7
				
					0
					 × 1

oh wait, I was using clearml == 0.17.5 and I also had this issue

  				
Posted 
	3 years ago

					More  		
  Report
		
					MuddySquid7
				
					0
					 × 1

Hi MuddySquid7 issue is verified, v1.1.1 will be released in a few hours with a fix.
Thank you for noticing!

  				
Posted 
	3 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

the issue is that the confusion matrix showing for epoch 101 is in fact the one for epoch 1.
The images are stored in the default files server

  				
Posted 
	3 years ago

					More  		
  Report
		
					MuddySquid7
				
					0
					 × 1

From creating the event to actually sending it ... 30 min sounds like enough "time"...

  				
Posted 
	3 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

I don't understand though..why doesn't this happen on my other experiments?

  				
Posted 
	3 years ago

					More  		
  Report
		
					MuddySquid7
				
					0
					 × 1

I need to wait 100 epochs 😅

  				
Posted 
	3 years ago

					More  		
  Report
		
					MuddySquid7
				
					0
					 × 1

Released:
https://github.com/allegroai/clearml/releases/tag/1.1.1

  				
Posted 
	3 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Okay I found the issue ( I think),
If the images are reported very quickly, it will "decide" you are about to override the previous one (i.e. 101 -> overwriting 0, which makes sense, the bug was it would disable the 101 from uploading and not the 0 🙂 )
Test fix:
in /backend_interface/metrics/events.py , line 292, change:
last_count = self._get_metric_count(self.metric, self.variant, next=False) if abs(self._count - last_count) > int(self._file_history_size): output = None elif isinstance(self._image_data, (six.StringIO, six.BytesIO)):to
if isinstance(self._image_data, (six.StringIO, six.BytesIO)):

  				
Posted 
	3 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

I'm creating them for tensorboard yes, and they appear under the debug samples tab

  				
Posted 
	3 years ago

					More  		
  Report
		
					MuddySquid7
				
					0
					 × 1

I'm afraid I'm still having the same issue..

  				
Posted 
	3 years ago

					More  		
  Report
		
					MuddySquid7
				
					0
					 × 1

MuddySquid7 I might have found something, and this is very very odd, it seems it will Not upload any new images post the history size, which is very odd considering the number of users actively using this feature...
Do you want to try a hack to see if it solved your issue ?

  				
Posted 
	3 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

how quick is "very quickly"? we are talking about maybe 30 minutes to reach 100 epochs

  				
Posted 
	3 years ago

					More  		
  Report
		
					MuddySquid7
				
					0
					 × 1

yes please 🙂

  				
Posted 
	3 years ago

					More  		
  Report
		
					MuddySquid7
				
					0
					 × 1

MuddySquid7 you mean you are creating them with TB ? or are you uploading them as debug images ?
Specifically in the ClearML UI, do you have it under "plots" tab or "debug samples" tab ?

  				
Posted 
	3 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

thanks!!

  				
Posted 
	3 years ago

					More  		
  Report
		
					MuddySquid7
				
					0
					 × 1

oh...so is this a bug?

  				
Posted 
	3 years ago

					More  		
  Report
		
					MuddySquid7
				
					0
					 × 1

Okay I found it, this is due to the fact the newer versions are sending the events/images in a subprocess (it used to be a thread).
The creation of the object is done on he main process, updating file index (round robin manner), but the check itself, happens on the subprocess., which is not "aware" of the used indexes (i.e. it is always 0, hence when exceeding the history side, it skips it)

  				
Posted 
	3 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Write your answer

1K Views

30 Answers

3 years ago

2 years ago