Hi GhastlySeaurchin98 , how are you running the experiments - which type of machines - local or cloud? Are you running your own server or using the community?
It looks like you're on a self hosted server, the community server is app.clear.ml where you can just sign up and don't have to maintain your own server 🙂
GhastlySeaurchin98 , I think this is more related to how Optuna works, it aborts the experiment. I think you would need to modify something in order for it to run the way you want
CostlyOstrich36 yeah then we are definately on self-hosted.
During our first large hyperpameter run, we have noticed that there are some tasks that get aborted with the following console log:
This looks like the HPO algorithm doing early stopping, which algo are you using ?
CostlyOstrich36 thanks for your insight. then we will probably check Optuna's code and see if we can eliminate this behaviour, if not we are gonna rethink our workflow
We are using community version I belive, since we are not paying.
we are running it locally on a self hosted server with a single 3080Ti. ClearML server and worker is on the same machine
did you mean this by "own server or community"?
AgitatedDove14 yeah, just checked my code and here's a snippet which uses EarlyStopping
:
history = regressor.fit(
x=train_generator,
validation_data=test_generator,
epochs=epochs,
callbacks=[
board,
LearningRateScheduler(scheduler),
EarlyStopping(monitor="loss", patience=20, restore_best_weights=True),
],
)
But does it mean it will abort the whole experiment? Its a problem since we are fitting multiple models in a single experiment.