frameworks = { 'tensorboard': False, 'pytorch': False } task = Task.init( project_name="train_pipeline", task_name="test_train_python", task_type=TaskTypes.training, auto_connect_frameworks=frameworks )
Hmm I wonder, can you try with this line before?Task._report_subprocess_enabled = False frameworks = { 'tensorboard': True, 'pytorch': False } Task.init(...)
Okay so the way it works is that it moves all the logging to background process, But if you have a Lot of data, actually pushing the data between python processes is Not very efficient. This line basically tells it to just use background thread (instead of background process), for sending the data to the server.
The idea behind using background process in the first place is to better support pytorch workers that spin a lot of subprocesses, and we do not want to add a thread per process and increase the time for it takes to spin them
My internet traffic looks wierd.I think this is because tensorboard logs too much data on each batch and ClearML send it to server. How can i fix it? My training speed decreased by 5-6 times.
BTW: ComfortableShark77 the network is being sent in background process, it should not effect the processing time, no?
OS
Linux-5.10.60.1-microsoft-standard-WSL2-x86_64-with-glibc2.29 Ubuntu 20.04 LTS
python_version
3.8.10
With this setting I have a slow learning speed, but if I use the setting I sent earlier then learning speed is normal
could you try this one:frameworks = { 'tensorboard': True, 'pytorch': False }
This would log the TB (in the BKG), but no model registration (i.e. serial)
Hi ComfortableShark77 , I suspect you are correct, can you try turning off the tensorboard framework connection in your Task.init() call using the argument auto_connect_frameworks={"tensorboard": False}
to make sure this is the cause?
It could be the model storing? could it be the peak is at the end of the epoch ?
Hi SuccessfulKoala55 , I already test it. Training is much faster without the tensorboard
(this is the part that is not in the background, so if the epoch is short it might have an effect)
AgitatedDove14 Well then I have no idea why with tensorboard learning is so slow
the compute time for each batch is about the same