Hi @<1541592204353474560:profile|GhastlySeaurchin98> , how are you running the experiments - which type of machines - local or cloud? Are you running your own server or using the community?
we are running it locally on a self hosted server with a single 3080Ti. ClearML server and worker is on the same machine
We are using community version I belive, since we are not paying.
did you mean this by "own server or community"?
It looks like you're on a self hosted server, the community server is app.clear.ml where you can just sign up and don't have to maintain your own server 🙂
Hi @<1541592204353474560:profile|GhastlySeaurchin98>
During our first large hyperpameter run, we have noticed that there are some tasks that get aborted with the following console log:
This looks like the HPO algorithm doing early stopping, which algo are you using ?
@<1523701070390366208:profile|CostlyOstrich36> yeah then we are definately on self-hosted.
@<1523701205467926528:profile|AgitatedDove14> yeah, just checked my code and here's a snippet which uses EarlyStopping
:
history = regressor.fit(
x=train_generator,
validation_data=test_generator,
epochs=epochs,
callbacks=[
board,
LearningRateScheduler(scheduler),
EarlyStopping(monitor="loss", patience=20, restore_best_weights=True),
],
)
But does it mean it will abort the whole experiment? Its a problem since we are fitting multiple models in a single experiment.
@<1541954614918647808:profile|EmaciatedCormorant16> FYI
@<1541592204353474560:profile|GhastlySeaurchin98> , I think this is more related to how Optuna works, it aborts the experiment. I think you would need to modify something in order for it to run the way you want
@<1523701070390366208:profile|CostlyOstrich36> thanks for your insight. then we will probably check Optuna's code and see if we can eliminate this behaviour, if not we are gonna rethink our workflow