however if I want multiple machines syncing with the optimizer, for pulling the sampled hyper parameters and reporting results, I can't see how it would work
I have to admit, this is where I'm loosing you.
I thought you wanted to avoid the agent, since you wanted to run everything locally, wasn't that the issue ?
Maybe there is some background missing here, let me see if I can explain how the optimizer works.
In your actual training code you have something like:params = {'lr': 0.3, 'key': 'option1'} task.connect(params) ... Logger.report_scalars(title='loss', series='l1', value=...)
The values could also be coming from argparser, but the concept is the same. Or TB reporting instead of using the report_scalars.
2. When running the optimizer you have to provide two things:
a. The scalar we are trying to optimize. In this example title='loss', series='l1'
b. The arguments we will change and the sampling range. For example General/lr
[0.01, 1.0, 0.02]
3. The optimizer (assuming active one and not randome/grid) Optuna for example, will sample new General/lr
values based on the reported
title='loss', series='l1' ` of the training code for us.
This is done automagically! Meaning:
The optimizer clones a Task, and changes the configuration/hyper-parameters (the effect is that task.connect when executed by the agent is now not storing the dict, but updating the dict from the backend). Then the optimizer launches the Task and actively in realtime pulls the scalars your training code reports (via the logger or TB). Finally the optimize can shutdown the training on the remote machine automatically and launch a new one.
Make sense ?