Hello Everyone. I Don'T Uderstand Why Is My Training Slower With Connected Tensorboard Than Without It. I Have Some Thoughts About It But I Not Sure. My Internet Traffic Looks Wierd.I Think This Is Because Tensorboard Logs Too Much Data On Each Batch And

Answered

Hello everyone. I don't uderstand why is my training slower with connected tensorboard than without it. I have some thoughts about it but i not sure. My internet traffic looks wierd.I think this is because tensorboard logs too much data on each batch and ClearML send it to server. How can i fix it? My training speed decreased by 5-6 times.

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					ComfortableShark77
				
					0
					 × 1

Votes Newest

Answers 20

Hi ComfortableShark77 , I suspect you are correct, can you try turning off the tensorboard framework connection in your Task.init() call using the argument auto_connect_frameworks={"tensorboard": False} to make sure this is the cause?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Hi SuccessfulKoala55 , I already test it. Training is much faster without the tensorboard

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					ComfortableShark77
				
					0
					 × 1

My internet traffic looks wierd.I think this is because tensorboard logs too much data on each batch and ClearML send it to server. How can i fix it? My training speed decreased by 5-6 times.

BTW: ComfortableShark77 the network is being sent in background process, it should not effect the processing time, no?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

AgitatedDove14 Well then I have no idea why with tensorboard learning is so slow

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					ComfortableShark77
				
					0
					 × 1

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					ComfortableShark77
				
					0
					 × 1

It could be the model storing? could it be the peak is at the end of the epoch ?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

(this is the part that is not in the background, so if the epoch is short it might have an effect)

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

frameworks = { 'tensorboard': False, 'pytorch': False } task = Task.init( project_name="train_pipeline", task_name="test_train_python", task_type=TaskTypes.training, auto_connect_frameworks=frameworks )

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					ComfortableShark77
				
					0
					 × 1

model is resnet18

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					ComfortableShark77
				
					0
					 × 1

the compute time for each batch is about the same

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					ComfortableShark77
				
					0
					 × 1

could you try this one:
frameworks = { 'tensorboard': True, 'pytorch': False }This would log the TB (in the BKG), but no model registration (i.e. serial)

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

With this setting I have a slow learning speed, but if I use the setting I sent earlier then learning speed is normal

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					ComfortableShark77
				
					0
					 × 1

What's the OS / Python version?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

OS
Linux-5.10.60.1-microsoft-standard-WSL2-x86_64-with-glibc2.29 Ubuntu 20.04 LTS
python_version
3.8.10

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					ComfortableShark77
				
					0
					 × 1

Hmm I wonder, can you try with this line before?
Task._report_subprocess_enabled = False frameworks = { 'tensorboard': True, 'pytorch': False } Task.init(...)

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

it works

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					ComfortableShark77
				
					0
					 × 1

What does this line do?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					ComfortableShark77
				
					0
					 × 1

Okay so the way it works is that it moves all the logging to background process, But if you have a Lot of data, actually pushing the data between python processes is Not very efficient. This line basically tells it to just use background thread (instead of background process), for sending the data to the server.
The idea behind using background process in the first place is to better support pytorch workers that spin a lot of subprocesses, and we do not want to add a thread per process and increase the time for it takes to spin them

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

https://stackoverflow.com/questions/47085458/why-is-multiprocessing-queue-get-so-slow

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

thanks for the help

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					ComfortableShark77
				
					0
					 × 1

Write your answer

1K Views

20 Answers

2 years ago

7 months ago