Reputation
Badges 1
9 × Eureka!Great, looking forward!
Ah ok cool! Good to know thanks for clarifying ๐ Wasn't clear if it was a bug or expected behaviour.
Good news:
ย newย
best_model
ย is saved, add a tagย
best
,
Already supported, (you just can't see the tag, but it is there :))
Interesting! Could you point me where the tagging happens? Also, by "see" do you mean UI? I tried doing task.models["outputs"][-1].tag
but there's no property tag ( AttributeError: 'Model' object has no attribute 'tag'
, seems that only OutputModel
s have the tag property?)
I'll answer specifically for ignite on y...
It doesnt work for me ๐
I'm using firefox 76.0.1 (64-bit) btw
Ah nice didn't notice that thread, will add
I'd prefer to use config_dict, I think it's cleaner (as a workaround for metadata). However, since I'm using ignite, I think I have no way to actually do that, or at least i'm not aware of it.
But I think that whatever way one chooses, you will have to go through N best models right after training and find the best one (because of the issue we're discussing on ignite).
I think ideal would be one of the two:
Only store a single model_best
. After training you just find the model with t...
ah I see it. DId you add right now or was I just blind? Either way thanks for pointing ๐
AgitatedDove14 we have switched to a 8 core 16 gb ram machine and haven't faced the issue since. We'll let you know if it happens. But I'm pretty confident it was the size of the machine that caused it (as I mentioned it was a 1 cpu 1.5gb ram machine)
AgitatedDove14 the funniest thing is that a train service called Allegro exists:
https://en.wikipedia.org/wiki/Allegro_(train)
Anytime I google - first result :D
So, using ignite I do the following:
` task.phases['valid'].add_event_handler(
Events.EPOCH_COMPLETED(every=1),
Checkpoint(to_save, TrainsSaver(output_uri=self.checkpoint_path), 'best', n_saved=1,
score_function=lambda x: task.phases['valid'].state.metrics[self.monitor]
if self.monitor_mode == 'max' else
-task.phases['valid'].state.metrics[self.monitor],
score_name=sel...
AgitatedDove14 Do we know what the User aborted: stopping task (3)
means? It's different than when you actually abort a task yourself: User aborted: stopping task (1)
.
I think that the problem happened because the VM we were using for the service queue workers was quite small (1 cpu, 1.5 gb ram), and the error message above might point to that.
We switched to a bigger one and will let you know if that was the problem.