Reputation
Badges 1
9 × Eureka!I see! Then the command clearml-agent execute --id <task_id here>
should reload the reported scalars and the task needs to reload last checkpoints only, right?
That's good question too! We didn't figure out the best way of continuing for both the grid and optuna. Can you suggest something?
AgitatedDove14 Let me clarify I think you have misunderstood me.
The main reason we need the above mentioned functionality is because there are some experiments that need to run for a long time. Let's say weeks.
However, the importance of the experiment is low so when other, more important experiments appear. We need to temporarily pause(kill or something else) running HPO task and reassign the resource for other needs.
Later, when more important experiments has been completed, we can conti...
Thanks for the answers AgitatedDove14 .
I will look GH issues in and open one if there isn't related one.
Hi AgitatedDove14 I get the reported scalars from the web usingmodel_task = Task.get_task(task_id=model_task_id) scalars = model_task.get_reported_scalars()
then register each of the scalars with something likelogger.report_scalar(title=metric_key, series=series_val['name'], value=y, iteration=x)
Then you have reported scalars to which I am able to append rest of the model training reports.
Workers are running across multiple machines and you can monitor if a task is dead by looking...
AgitatedDove14 I am not restarting the agent itself, I just need to be able continue the experiment from the same progress point. It can be a different agent. In fact, I am just loading the progress to another agent within the available queue.
Quick question when you say the HPO Task, you mean the HPO controller logic Task (i.e. the one launching the training jobs), or do you mean the actual training job itself (i.e. running with a specific set of parameters decided by the HPO controlling task) ?
AgitatedDove14 Sorry, my bad! By HPO task
I mean the actual training job itself.
We run the HPO controller logic Task on a separate cpu only machine, so we can think that this task is always on. Only the training jobs can go ...
Is there a way we can update the docs webpage?
Oh, I see! Apparently, there are no tags setter in the documentation, even though it is in the source code itself. Thanks!
@<1523701087100473344:profile|SuccessfulKoala55> The Model object does not seem to have the update
method, is there a different version of SDK you are looking at?