And here is how the error appears. Trying to get the metric that was not logged.
Hi @<1790190274475986944:profile|UpsetPanda50> , are you running them on the same machine/agent? Can you please provide a full log of one run that worked and one that didn't on the same machine?
Here for instance we had only two cases of TypeError: 'NoneType' object is not subscriptable, one is on line 9846. But as you see in the pic the workers are going down.
@<1523701070390366208:profile|CostlyOstrich36> Here is HyperparameterOptimizer class
hpo = HyperParameterOptimizer(
# Base experiment to optimize
base_task_id=base_task_id,
# Hyperparameters to tune
hyper_parameters=param_ranges,
# Objective metric
objective_metric_title=[
metric_title for metric_title in opt_conf.hpo_params.objective_metric_title
],
objective_metric_series=[
metric for metric in opt_conf.hpo_params.objective_metric_series
],
objective_metric_sign=[
direction for direction in opt_conf.hpo_params.objective_metric_sign
],
# Optimization strategy
optimizer_class=OptimizerOptuna,
# Execution configuration
execution_queue=opt_conf.hpo_params.execution_queue,
save_top_k_tasks_only=-1,
spawn_project=f"{opt_conf.task_params.project_name}/opt",
min_iteration_per_job=opt_conf.hpo_params.min_iteration_per_job,
max_iteration_per_job=opt_conf.hpo_params.max_iteration_per_job,
# pool_period_min=40,
# time_limit_per_job=120,
# let us limit the number of concurrent experiments,
# this in turn will make sure we do dont bombard the scheduler with experiments.
# if we have an auto-scaler connected, this, by proxy, will limit the number of machine
max_number_of_concurrent_tasks=opt_conf.hpo_params.max_number_of_concurrent_tasks,
# set the maximum number of jobs to launch for the optimization, default (None) unlimited
# If OptimizerBOHB is used, it defined the maximum budget in terms of full jobs
# basically the cumulative number of iterations will not exceed total_max_jobs * max_iteration_per_job
total_max_jobs=opt_conf.hpo_params.total_max_jobs,
# optuna_pruner=pruner_dict.get(
# opt_conf.hpo_params.pruner
# ), # HyperbandPruner(min_resource=5, max_resource=80),
# optuna_sampler=sampler_dict.get(opt_conf.hpo_params.sampler),
)
and here the hpo_params used
hpo_params:
objective-metric-title: ["HBT-KPI --- 2024-12-26 to 2025-01-12"]
objective-metric-series: ["SR"]
objective-metric-sign: ["max"]
time-limit: 72000.0
execution-queue: hpo_mmd
min-iteration-per_job: 50
max-iteration-per_job: 10000
max-number-of-concurrent-tasks: 100
total-max-jobs: 2000
pruner: none
sampler: none #random
Do you see any reason for the optimization finish before the total-max-jobs get reached?
I am not even sure if the issue is only that of not getting the metric. if that happens, I suppose the hpo should inject new parameters in the next iteration (since the max was not reached), but instead it stops running, and completes the optimization...
Hi Nathan. The error was about an internal error on our simulation. After that bug fix everything was ok. But we still have the problem that when any trial fails, it breaks down all the simulation and the pipeline stops to create new trials and the simulation stops. Is the same when I have only one work that crash ( maybe some no free space or network problem) and when it happens the main pipeline receives a trial fail and after that it does not create more trials. And all the simulations starts to die because there is no more new trials.
I've recently run into this error myself. Did you find any resolution?