Problem: Excessive Scalar Storage From Tensorboard Integration Causing Out-Of-Memory On Clearml Server Hi Team, We’Ve Run Into A Problem With Clearml Ingesting Extremely Large Numbers Of Scalars From Tensorboard (Auto_Connect_Frameworks) (~800K Samples P

Answered

Problem: Excessive Scalar Storage from TensorBoard Integration Causing Out-of-Memory on ClearML Server

Hi team,
We’ve run into a problem with ClearML ingesting extremely large numbers of scalars from TensorBoard (auto_connect_frameworks) (~800k samples per time series), apparently due to storage of all scalar values instead of TensorBoard’s default 1000-sample reservoir sampling. This has led to the ClearML server eventually running out of memory.

Could you please confirm:

Does ClearML currently perform no scalar downsampling when ingesting TensorBoard scalars?
Is there a recommended way to prune or downsample scalar history after-the-fact, ideally via API or script? (Pointers to DB structure or scripts welcome.)
Are there plans to implement per-task resource limits (e.g., max scalars per series) automatically in the backend?

If there are any best-practices or workarounds for this scenario, please let me know!
PS. I’m a user of ClearML (not admin of the ClearML server), but will forward your information to the admins.

  				
Posted 
	one month ago

					More
				  		
  Report
		
					DepravedKoala88
				
					0
					 × 1

Votes Newest

Answers 2

Hi @<1853245764742942720:profile|DepravedKoala88> , I don't think there is any downsampling when ingesting from Tensorboard. You can always turn off the autologging and only log what you want and downsample accordingly. Keep in mind that on one hand you should avoid bloat on the server and on the other have high enough granularity in your scalars.

What do you think?

  				
Posted 
	one month ago

					More
				  		
  Report
		
					CostlyOstrich36
				
					0

Hi @<1523701070390366208:profile|CostlyOstrich36> , thank you for your quick response and for confirming the current behavior.
We’ve already tried the approaches you mentioned—disabling the TensorBoard autologging and reducing the scalar logging frequency—which definitely help for new experiments going forward.
However, our main challenge is with existing (“legacy”) tasks that already logged hundreds of thousands of scalars per experiment. We have temporarily alleviated the problem by adding more RAM. However, this isn't a sustainable solution
It would be extremely helpful to have a way to downsample or prune excessive scalars for past/existing tasks directly on the server. Is there any possibility that ClearML might implement such a feature or provide an admin tool/script for this purpose? This would be very valuable for maintenance, resource management, and scalability.
Thank you again!

  				
Posted 
	one month ago

					More
				  		
  Report
		
					DepravedKoala88
				
					0
					 × 1

Write your answer

238 Views

2 Answers

one month ago