Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hello Everyone. I Don'T Uderstand Why Is My Training Slower With Connected Tensorboard Than Without It. I Have Some Thoughts About It But I Not Sure. My Internet Traffic Looks Wierd.I Think This Is Because Tensorboard Logs Too Much Data On Each Batch And

Hello everyone. I don't uderstand why is my training slower with connected tensorboard than without it. I have some thoughts about it but i not sure. My internet traffic looks wierd.I think this is because tensorboard logs too much data on each batch and ClearML send it to server. How can i fix it? My training speed decreased by 5-6 times.

  
  
Posted 2 years ago
Votes Newest

Answers 20


Hi ComfortableShark77 , I suspect you are correct, can you try turning off the tensorboard framework connection in your Task.init() call using the argument auto_connect_frameworks={"tensorboard": False} to make sure this is the cause?

  
  
Posted 2 years ago

Hi SuccessfulKoala55 , I already test it. Training is much faster without the tensorboard

  
  
Posted 2 years ago

My internet traffic looks wierd.I think this is because tensorboard logs too much data on each batch and ClearML send it to server. How can i fix it? My training speed decreased by 5-6 times.

BTW: ComfortableShark77 the network is being sent in background process, it should not effect the processing time, no?

  
  
Posted 2 years ago

AgitatedDove14 Well then I have no idea why with tensorboard learning is so slow

  
  
Posted 2 years ago

image
image

  
  
Posted 2 years ago

It could be the model storing? could it be the peak is at the end of the epoch ?

  
  
Posted 2 years ago

(this is the part that is not in the background, so if the epoch is short it might have an effect)

  
  
Posted 2 years ago

frameworks = { 'tensorboard': False, 'pytorch': False } task = Task.init( project_name="train_pipeline", task_name="test_train_python", task_type=TaskTypes.training, auto_connect_frameworks=frameworks )

  
  
Posted 2 years ago

model is resnet18

  
  
Posted 2 years ago

the compute time for each batch is about the same

  
  
Posted 2 years ago

could you try this one:
frameworks = { 'tensorboard': True, 'pytorch': False }This would log the TB (in the BKG), but no model registration (i.e. serial)

  
  
Posted 2 years ago

With this setting I have a slow learning speed, but if I use the setting I sent earlier then learning speed is normal

  
  
Posted 2 years ago

What's the OS / Python version?

  
  
Posted 2 years ago

OS
Linux-5.10.60.1-microsoft-standard-WSL2-x86_64-with-glibc2.29 Ubuntu 20.04 LTS
python_version
3.8.10

  
  
Posted 2 years ago

Hmm I wonder, can you try with this line before?
Task._report_subprocess_enabled = False frameworks = { 'tensorboard': True, 'pytorch': False } Task.init(...)

  
  
Posted 2 years ago

it works

  
  
Posted 2 years ago

What does this line do?

  
  
Posted 2 years ago

Okay so the way it works is that it moves all the logging to background process, But if you have a Lot of data, actually pushing the data between python processes is Not very efficient. This line basically tells it to just use background thread (instead of background process), for sending the data to the server.
The idea behind using background process in the first place is to better support pytorch workers that spin a lot of subprocesses, and we do not want to add a thread per process and increase the time for it takes to spin them

  
  
Posted 2 years ago

thanks for the help

  
  
Posted 2 years ago
1K Views
20 Answers
2 years ago
7 months ago
Tags