Hi @<1541592204353474560:profile|GhastlySeaurchin98> , how are you running the experiments - which type of machines - local or cloud? Are you running your own server or using the community?
we are running it locally on a self hosted server with a single 3080Ti. ClearML server and worker is on the same machine
We are using community version I belive, since we are not paying.
did you mean this by "own server or community"?
During our first large hyperpameter run, we have noticed that there are some tasks that get aborted with the following console log:
This looks like the HPO algorithm doing early stopping, which algo are you using ?
It looks like you're on a self hosted server, the community server is app.clear.ml where you can just sign up and don't have to maintain your own server 🙂
@<1523701070390366208:profile|CostlyOstrich36> yeah then we are definately on self-hosted.
I am using
@<1523701205467926528:profile|AgitatedDove14> yeah, just checked my code and here's a snippet which uses
history = regressor.fit( x=train_generator, validation_data=test_generator, epochs=epochs, callbacks=[ board, LearningRateScheduler(scheduler), EarlyStopping(monitor="loss", patience=20, restore_best_weights=True), ], )
But does it mean it will abort the whole experiment? Its a problem since we are fitting multiple models in a single experiment.
@<1541592204353474560:profile|GhastlySeaurchin98> , I think this is more related to how Optuna works, it aborts the experiment. I think you would need to modify something in order for it to run the way you want
@<1523701070390366208:profile|CostlyOstrich36> thanks for your insight. then we will probably check Optuna's code and see if we can eliminate this behaviour, if not we are gonna rethink our workflow