OK, I will look into agents and think about this. One pain we have is that tensorboard logs are stuck on the machine used for training, and we can’t compare models training on two different machines in one tensorboard (unless they mount the same network filesystem). But it is also important to be able to see TB both during training and after it is finished (and even though the log files are large, storage is cheap, so maybe it would be OK to keep them around). I need to think about the best way to organize this though. For instance, maybe we should log logs directly to S3? We would still need some system for keeping track of where exactly they are and for launching tensorboard instances to show a given set of logs.
Hi LivelyLion31 I missed your S3 question, apologies. What did you guys end up doing?
BTW you could always upload the entire TB log folder as artifact, it's simple task.upload_artifact('tensorboard', './tblogsfolder')
We haven’t done anything about it yet. But we are planning to try out a few experiment management systems soon, including trains
Hi LivelyLion31
Yes, the reason we designed Trains with an automagic integration is exactly that reason, so users do not need to learn another package and that with almost no effort you get most of the benefits.
Regrading the TB files, from experience most users will use the TB files short after they executed the experiment, usually for debugging and in depth capabilities (like network debugger profile etc), metric view is something that is much easier to do on a centralized server (like on the Trains-Server).
So we could not find good uses cases for constantly storing the TB protobuf files on the backend (they are extremely large!).
That said you can always upload the TB protobuf as an artifact at the end of the experiment:Task.current_task().upload_artifact('tensorboard', '/tmp/my.tensorboard_file/pb')
If you guys feel spinning a TB serving all the tensorboard is something you will use. You can quickly write a code that will do just that, and launch it with trains-agent. There is a nice example of using trains-agent as a way to spin a jupyter notebook that can server as a good reference:
https://github.com/allegroai/trains/blob/master/examples/execute_jupyter_notebook_server.py