So i have a HPO pipeline like this, many modules to be optimized.
And after some time i get a picture like this, where same hparas are trained.
My thoughts on fix to to add code in each training script which will get parent's HPO artifact table and look for same hparams, if exists, abort task. This will fix wasted compute issue, but i wonder if it can be done better. Like spending this compute on other hparams, that otherwise will be left untried
Also, a totally separate issue, i wonder if there is an early stopping, when its obvious that suggested hparams are suboptimal, couldn't find anything in docs. I know there is a max_iteration_per_job
but couldn't understand its usability from docs either.
Hi @<1623491856241266688:profile|TenseCrab59> , can you elaborate on what do you mean spending this compute on other hprams? I think you could in theory check if a previous artifact file is located then you could also change the parameters & task name from within the code
@<1523701070390366208:profile|CostlyOstrich36> I mean that Optuna suggests {x=10, y=20}
for example. Then it becomes next best result in HPO process, then Optuna tends to suggest the very same hparameters, while the parameters space hasn't been fully explored. If i cancel trials with same hparams, it more likely that major part of defined total_max_jobs
will be cancelled, thus it renders this parameter hardly usable
I understand. In that case you could implement some code to check if the same parameters were used before and then 'switch' to different parameters that haven't been checked yet. I think it's a bit 'hacky' so I would suggest waiting for a fix from Optuna
Thanks, and by the way can you say anything about early stopping? i asked about it here . I guess it also can only be done through 'hacky' solutions?
In the HPO application I see the following explanation:
'Maximum iterations per experiment after which it will be stopped. Iterations are based on the experiments' own reporting (for example, if experiments report every epoch, then iterations=epochs)'
Well, that just didn't work for me, i set it to 1, and experiments run full time anyway)