... training script was set to upload every epoch. Seems like this resulted in a torrent of metrics being uploaded.
oh that makes sense, so basically you were bombarding the server with requests, and ending with kind of denial of service
How much free RAM / disk do you have there now? How's the CPU utilization ? how many Tasks are working with this machine at the same time
Currently running it on a t3.xlarge
which has 4CPU's, 16GB RAM and 300GB SSD
For an update 🙂
I think we identified that when moving from a training to fine tuning dataset (which was 1/1000th the size) our training script was set to upload every epoch. Seems like this resulted in a torrent of metrics being uploaded.
Since modifying this to be less frequent we have seen the index latency drop dramatically
Hi @<1590152178218045440:profile|HarebrainedToad56>
Yes you are correct all TB logs are stored into the ELK in the clearml backend. This really scales well and rarely has issues, as long of course that the clearml-server is running on strong enough machine. How many RAM / HD you have on the clearml-server ?