CostlyOstrich36 Here is HyperparameterOptimizer class
hpo = HyperParameterOptimizer(
# Base experiment to optimize
base_task_id=base_task_id,
# Hyperparameters to tune
hyper_parameters=param_ranges,
# Objective metric
objective_metric_title=[
metric_title for metric_title in opt_conf.hpo_params.objective_metric_title
],
objective_metric_series=[
metric for metric in opt_conf.hpo_params.objective_metric_series
],
objective_metric_sign=[
direction for direction in opt_conf.hpo_params.objective_metric_sign
],
# Optimization strategy
optimizer_class=OptimizerOptuna,
# Execution configuration
execution_queue=opt_conf.hpo_params.execution_queue,
save_top_k_tasks_only=-1,
spawn_project=f"{opt_conf.task_params.project_name}/opt",
min_iteration_per_job=opt_conf.hpo_params.min_iteration_per_job,
max_iteration_per_job=opt_conf.hpo_params.max_iteration_per_job,
# pool_period_min=40,
# time_limit_per_job=120,
# let us limit the number of concurrent experiments,
# this in turn will make sure we do dont bombard the scheduler with experiments.
# if we have an auto-scaler connected, this, by proxy, will limit the number of machine
max_number_of_concurrent_tasks=opt_conf.hpo_params.max_number_of_concurrent_tasks,
# set the maximum number of jobs to launch for the optimization, default (None) unlimited
# If OptimizerBOHB is used, it defined the maximum budget in terms of full jobs
# basically the cumulative number of iterations will not exceed total_max_jobs * max_iteration_per_job
total_max_jobs=opt_conf.hpo_params.total_max_jobs,
# optuna_pruner=pruner_dict.get(
# opt_conf.hpo_params.pruner
# ), # HyperbandPruner(min_resource=5, max_resource=80),
# optuna_sampler=sampler_dict.get(opt_conf.hpo_params.sampler),
)
and here the hpo_params used
hpo_params:
objective-metric-title: ["HBT-KPI --- 2024-12-26 to 2025-01-12"]
objective-metric-series: ["SR"]
objective-metric-sign: ["max"]
time-limit: 72000.0
execution-queue: hpo_mmd
min-iteration-per_job: 50
max-iteration-per_job: 10000
max-number-of-concurrent-tasks: 100
total-max-jobs: 2000
pruner: none
sampler: none #random
Do you see any reason for the optimization finish before the total-max-jobs get reached?
I am not even sure if the issue is only that of not getting the metric. if that happens, I suppose the hpo should inject new parameters in the next iteration (since the max was not reached), but instead it stops running, and completes the optimization...
Here for instance we had only two cases of TypeError: 'NoneType' object is not subscriptable, one is on line 9846. But as you see in the pic the workers are going down.
Hi UpsetPanda50 , are you running them on the same machine/agent? Can you please provide a full log of one run that worked and one that didn't on the same machine?
And here is how the error appears. Trying to get the metric that was not logged.
Hi CostlyOstrich36 , it's a pool of machines. I have attached two logs.