Still investigating, task.data.last_iteration
is correct (equal to engine.state["iteration"]
) when I resume the training
From our side everything work ok i would say if we resume from an epoch : https://demoapp.demo.clear.ml/projects/5678e22221984581b089fd110c8db1ea/compare-experiments;ids=da10aacccc97459ea13df27b9cd44561,8276746945f6424781309513cee21cf8/scalars/graph?scalars=graph
AgitatedDove14 any chance you found something interesting? 🙂
The problems comes from ClearML that thinks it starts from iteration 420, and then adds again the iteration number (421), so it starts logging from 420+421=841
JitteryCoyote63 Is this the issue ?
Trying now your code… should take a couple of mins
JitteryCoyote63 that makes total sense!!
The reporting subprocess is not being updated with the new value! Let me check how we can pass it along...
So I shouldn’t even need to call the
task.set_initial_iteration
function
I think just removing this call should solve it, I think that what's going on is that this is called twice (once internal once manually by your code)
I also tried task.set_initial_iteration(-task.data.last_iteration)
, hoping it would counteract the bug, didn’t work
JitteryCoyote63 no I think this is all controlled from the python side.
Let me check something
JitteryCoyote63
somehow the previous iterations, not sure yet if it’s coming from my code, ignite or clearml
ClearML will automatically continue reporting from the previous iteration (i.e. if before continuing the Task the last iteration was 100, then the next report with iteration =0 will actually be 101)
task.set_initial_iteration(engine.state.iteration)
Basically it is called automatically by ClearML (obviously only when you continue an aborted Task)
Is there any logic on the server side that could change the iteration number?
I might gave an idea, could you test with:
` from clearml import Task
Task._report_subprocess_enabled = False
...
real code here `
AgitatedDove14 I made some progress:
In clearml.conf of the agent, I set: sdk.development.report_use_subprocess = false
(because I had the feeling that Task._report_subprocess_enabled = False
wasn’t taken into account) I’ve set task.set_initial_iteration(0)
Now I was able to get the followin graph after resuming -
Although task.data.last_iteration
is correct when resuming, there is still this doubling effect when logging metrics after resuming 😞
Mmmh unfortunately not easily… I will try to debug deeper today, is there a way to resume a task from code to debug locally?
Something like replacing Task.init
with Task.get_task
so that Task.current_task
is the same task as the output of Task.get_task
btw I monkey patched ignite’s function global_step_from_engine
to print the iteration and passed the modified function to the ClearMLLogger.attach_output_handler(…, global_step_transform=patched_global_step_from_engine(engine))
. It prints the correct iteration number when calling ClearMLLogger.OutputHandler.__ call__ .
` def call(self, engine: Engine, logger: ClearMLLogger, event_name: Union[str, Events]) -> None:
if not isinstance(logger, ClearMLLogger):
raise RuntimeError("Handler OutputHandler works only with ClearMLLogger")
metrics = self._setup_output_metrics(engine)
global_step = self.global_step_transform(engine, event_name) # type: ignore[misc]
if not isinstance(global_step, int):
raise TypeError(
f"global_step must be int, got {type(global_step)}."
" Please check the output of global_step_transform."
)
for key, value in metrics.items():
if isinstance(value, numbers.Number) or isinstance(value, torch.Tensor) and value.ndimension() == 0:
logger.clearml_logger.report_scalar(title=self.tag, series=key, iteration=global_step, value=value)
elif isinstance(value, torch.Tensor) and value.ndimension() == 1:
for i, v in enumerate(value):
logger.clearml_logger.report_scalar(
title=f"{self.tag}/{key}", series=str(i), iteration=global_step, value=v.item()
)
else:
warnings.warn(f"ClearMLLogger output_handler can not log metrics value type {type(value)}") `I don’t understand how it can log a wrong iteration if the ` global_step ` var has the right value in this function
self.clearml_task.get_initial_iteration()
also gives me the correct number
ClearML has a task.set_initial_iteration
, I used it as such:checkpoint = torch.load(checkpoint_fp, map_location="cuda:0") Checkpoint.load_objects(to_load=self.to_save, checkpoint=checkpoint) task.set_initial_iteration(engine.state.iteration)
But still the same issue, I am not sure whether I use it correctly and if it’s a bug or not, AgitatedDove14 ? (I am using clearml 1.0.4rc1, clearml-agent 1.0.0)
AgitatedDove14 I do continue an aborted Task yes - So I shouldn’t even need to call the task.set_initial_iteration
function, interesting! Do you have any ideas what could be a reason of the behavior I am observing? I am trying to find ways to debug it
After you call task.set_initial_iteration(0)
what do you get with task.get_initial_iteration()
, is it 0 ?
AgitatedDove14 Unfortunately no, I already had the problem before using the function, I added it hoping it would fix the issue but it didn’t
The jump in the loss when resuming at iteration 31 is probably another issue -> for now I can conclude that:
I need to set sdk.development.report_use_subprocess = false
I need to call task.set_initial_iteration(0)
If the reporting is done on a subprocess, I can imagine that the task.set_initial_iteration(0)
call is only effective in the main process, not in the subprocess used for reporting. Could it be the case?
Hmm so I guess the actual code adds it into the reporting itself ...
How about we call:task.set_initial_iteration(0)
Hi JitteryCoyote63 , I cannot reproduce it... when I call set initial iteration 0, it does what I'm expecting, and resend the scalar. I tested with the clearml ignite example, any thoughts on how I can reproduce?