we are running it locally on a self hosted server with a single 3080Ti. ClearML server and worker is on the same machine
It looks like you're on a self hosted server, the community server is app.clear.ml where you can just sign up and don't have to maintain your own server 🙂
@<1541592204353474560:profile|GhastlySeaurchin98> , I think this is more related to how Optuna works, it aborts the experiment. I think you would need to modify something in order for it to run the way you want
@<1523701070390366208:profile|CostlyOstrich36> yeah then we are definately on self-hosted.
@<1541954614918647808:profile|EmaciatedCormorant16> FYI
@<1523701205467926528:profile|AgitatedDove14> yeah, just checked my code and here's a snippet which uses EarlyStopping
:
history = regressor.fit(
x=train_generator,
validation_data=test_generator,
epochs=epochs,
callbacks=[
board,
LearningRateScheduler(scheduler),
EarlyStopping(monitor="loss", patience=20, restore_best_weights=True),
],
)
But does it mean it will abort the whole experiment? Its a problem since we are fitting multiple models in a single experiment.
Hi @<1541592204353474560:profile|GhastlySeaurchin98>
During our first large hyperpameter run, we have noticed that there are some tasks that get aborted with the following console log:
This looks like the HPO algorithm doing early stopping, which algo are you using ?
We are using community version I belive, since we are not paying.
@<1523701070390366208:profile|CostlyOstrich36> thanks for your insight. then we will probably check Optuna's code and see if we can eliminate this behaviour, if not we are gonna rethink our workflow
did you mean this by "own server or community"?
Hi @<1541592204353474560:profile|GhastlySeaurchin98> , how are you running the experiments - which type of machines - local or cloud? Are you running your own server or using the community?