Reputation
Badges 1
17 × Eureka!i run all seslf-hosted and without docker, with pipe.start_locally(run_pipeline_steps_locally=True)
It worked ok on my local machine but failed on remote, only difference i can think of is huge git diff on which clearml if complaining, can this be the reason?
Also, a totally separate issue, i wonder if there is an early stopping, when its obvious that suggested hparams are suboptimal, couldn't find anything in docs. I know there is a max_iteration_per_job
but couldn't understand its usability from docs either.
At this moment i somehow made it work, at least with debug sample
i think its actually a bug anyways, since there are no even task creation log in console, and still it's marked as succesfully completed
I installed ClearML Server with tutorial, yes i did it through docker, as i understood i need to look for Elastic logs in /var/log/ecs/ecs-agent.log but i have no such file or even ecs folder in this location on my ClearML Server machine
since i do not use docker and run with pipe.start_locally(run_pipeline_steps_locally=True)
at both local and server, do i still have a some kind of containers? how can i view logs of it?
So i have a HPO pipeline like this, many modules to be optimized.
And after some time i get a picture like this, where same hparas are trained.
Well, that just didn't work for me, i set it to 1, and experiments run full time anyway)
My thoughts on fix to to add code in each training script which will get parent's HPO artifact table and look for same hparams, if exists, abort task. This will fix wasted compute issue, but i wonder if it can be done better. Like spending this compute on other hparams, that otherwise will be left untried
@<1523701070390366208:profile|CostlyOstrich36> I mean that Optuna suggests {x=10, y=20}
for example. Then it becomes next best result in HPO process, then Optuna tends to suggest the very same hparameters, while the parameters space hasn't been fully explored. If i cancel trials with same hparams, it more likely that major part of defined total_max_jobs
will be cancelled, thus it renders this parameter hardly usable
Thanks, and by the way can you say anything about early stopping? i asked about it here . I guess it also can only be done through 'hacky' solutions?
Ok so only for new it will work?