Answered

Hello, I Would Like To Use Spot Instances Together With The Aws Autoscaler To Train Models With Pytorch/Ignite And I Am Wondering How To Support Interruptions During The Training (In Case The Instance Is Terminated By Aws). Is There Anything Already Built

Hello, I would like to use spot instances together with the AWS autoscaler to train models with pytorch/ignite and I am wondering how to support interruptions during the training (in case the instance is terminated by AWS). Is there anything already built in ClearML that supports this use case already?

  				
Posted 
	3 years ago

					More  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

Votes Newest

Answers 30

I also tried task.set_initial_iteration(-task.data.last_iteration) , hoping it would counteract the bug, didn’t work

  				
Posted 
	3 years ago

					More  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

Hmm so I guess the actual code adds it into the reporting itself ...
How about we call:
task.set_initial_iteration(0)

  				
Posted 
	3 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

AgitatedDove14 yes 🙂

  				
Posted 
	3 years ago

					More  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

Still investigating, task.data.last_iteration is correct (equal to engine.state["iteration"] ) when I resume the training

  				
Posted 
	3 years ago

					More  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

JitteryCoyote63 that makes total sense!!
The reporting subprocess is not being updated with the new value! Let me check how we can pass it along...

  				
Posted 
	3 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

self.clearml_task.get_initial_iteration() also gives me the correct number

  				
Posted 
	3 years ago

					More  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

AgitatedDove14 any chance you found something interesting? 🙂

  				
Posted 
	3 years ago

					More  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

Still the same problem 😞

  				
Posted 
	3 years ago

					More  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

Mmmh unfortunately not easily… I will try to debug deeper today, is there a way to resume a task from code to debug locally?
Something like replacing Task.init with Task.get_task so that Task.current_task is the same task as the output of Task.get_task

  				
Posted 
	3 years ago

					More  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

Let me see if I can reproduce something

  				
Posted 
	3 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Is there any logic on the server side that could change the iteration number?

  				
Posted 
	3 years ago

					More  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

Although task.data.last_iteration is correct when resuming, there is still this doubling effect when logging metrics after resuming 😞

  				
Posted 
	3 years ago

					More  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

JitteryCoyote63

somehow the previous iterations, not sure yet if it’s coming from my code, ignite or clearml

ClearML will automatically continue reporting from the previous iteration (i.e. if before continuing the Task the last iteration was 100, then the next report with iteration =0 will actually be 101)

task.set_initial_iteration(engine.state.iteration)

Basically it is called automatically by ClearML (obviously only when you continue an aborted Task)

  				
Posted 
	3 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

From our side everything work ok i would say if we resume from an epoch : https://demoapp.demo.clear.ml/projects/5678e22221984581b089fd110c8db1ea/compare-experiments;ids=da10aacccc97459ea13df27b9cd44561,8276746945f6424781309513cee21cf8/scalars/graph?scalars=graph

  				
Posted 
	3 years ago

					More  		
  Report
		
					TightElk12
				
					0
					 × 1

Hi JitteryCoyote63 , I cannot reproduce it... when I call set initial iteration 0, it does what I'm expecting, and resend the scalar. I tested with the clearml ignite example, any thoughts on how I can reproduce?

  				
Posted 
	3 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

I might gave an idea, could you test with:
` from clearml import Task
Task._report_subprocess_enabled = False

...

real code here `

  				
Posted 
	3 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

If the reporting is done on a subprocess, I can imagine that the task.set_initial_iteration(0) call is only effective in the main process, not in the subprocess used for reporting. Could it be the case?

  				
Posted 
	3 years ago

					More  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

Trying now your code… should take a couple of mins

  				
Posted 
	3 years ago

					More  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

ClearML has a task.set_initial_iteration , I used it as such:
checkpoint = torch.load(checkpoint_fp, map_location="cuda:0") Checkpoint.load_objects(to_load=self.to_save, checkpoint=checkpoint) task.set_initial_iteration(engine.state.iteration)But still the same issue, I am not sure whether I use it correctly and if it’s a bug or not, AgitatedDove14 ? (I am using clearml 1.0.4rc1, clearml-agent 1.0.0)

  				
Posted 
	3 years ago

					More  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

The problems comes from ClearML that thinks it starts from iteration 420, and then adds again the iteration number (421), so it starts logging from 420+421=841

JitteryCoyote63 Is this the issue ?

  				
Posted 
	3 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

btw I monkey patched ignite’s function global_step_from_engine to print the iteration and passed the modified function to the ClearMLLogger.attach_output_handler(…, global_step_transform=patched_global_step_from_engine(engine)) . It prints the correct iteration number when calling ClearMLLogger.OutputHandler.__ call__ .
` def call(self, engine: Engine, logger: ClearMLLogger, event_name: Union[str, Events]) -> None:

    if not isinstance(logger, ClearMLLogger):
        raise RuntimeError("Handler OutputHandler works only with ClearMLLogger")

    metrics = self._setup_output_metrics(engine)

    global_step = self.global_step_transform(engine, event_name)  # type: ignore[misc]

    if not isinstance(global_step, int):
        raise TypeError(
            f"global_step must be int, got {type(global_step)}."
            " Please check the output of global_step_transform."
        )

    for key, value in metrics.items():
        if isinstance(value, numbers.Number) or isinstance(value, torch.Tensor) and value.ndimension() == 0:
            logger.clearml_logger.report_scalar(title=self.tag, series=key, iteration=global_step, value=value)
        elif isinstance(value, torch.Tensor) and value.ndimension() == 1:
            for i, v in enumerate(value):
                logger.clearml_logger.report_scalar(
                    title=f"{self.tag}/{key}", series=str(i), iteration=global_step, value=v.item()
                )
        else:
            warnings.warn(f"ClearMLLogger output_handler can not log metrics value type {type(value)}") `I don’t understand how it can log a wrong iteration if the  ` global_step `  var has the right value in this function

  				
Posted 
	3 years ago

					More  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

AgitatedDove14 I made some progress:
In clearml.conf of the agent, I set: sdk.development.report_use_subprocess = false (because I had the feeling that Task._report_subprocess_enabled = False wasn’t taken into account) I’ve set task.set_initial_iteration(0) Now I was able to get the followin graph after resuming -

  				
Posted 
	3 years ago

					More  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

Yes, I get 0 afterwards

  				
Posted 
	3 years ago

					More  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

JitteryCoyote63 no I think this is all controlled from the python side.
Let me check something

  				
Posted 
	3 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

After you call task.set_initial_iteration(0) what do you get with task.get_initial_iteration() , is it 0 ?

  				
Posted 
	3 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

So I shouldn’t even need to call the

task.set_initial_iteration

function

I think just removing this call should solve it, I think that what's going on is that this is called twice (once internal once manually by your code)

  				
Posted 
	3 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

AgitatedDove14 Unfortunately no, I already had the problem before using the function, I added it hoping it would fix the issue but it didn’t

  				
Posted 
	3 years ago

					More  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

AgitatedDove14 Didn’t work 😞

  				
Posted 
	3 years ago

					More  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

The jump in the loss when resuming at iteration 31 is probably another issue -> for now I can conclude that:
I need to set sdk.development.report_use_subprocess = false I need to call task.set_initial_iteration(0)

  				
Posted 
	3 years ago

					More  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

AgitatedDove14 I do continue an aborted Task yes - So I shouldn’t even need to call the task.set_initial_iteration function, interesting! Do you have any ideas what could be a reason of the behavior I am observing? I am trying to find ways to debug it

  				
Posted 
	3 years ago

					More  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

Write your answer

1K Views

30 Answers

3 years ago

2 years ago