Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hello, I Would Like To Use Spot Instances Together With The Aws Autoscaler To Train Models With Pytorch/Ignite And I Am Wondering How To Support Interruptions During The Training (In Case The Instance Is Terminated By Aws). Is There Anything Already Built

Hello, I would like to use spot instances together with the AWS autoscaler to train models with pytorch/ignite and I am wondering how to support interruptions during the training (in case the instance is terminated by AWS). Is there anything already built in ClearML that supports this use case already?

  
  
Posted 3 years ago
Votes Newest

Answers 30


I also tried task.set_initial_iteration(-task.data.last_iteration) , hoping it would counteract the bug, didn’t work

  
  
Posted 3 years ago

self.clearml_task.get_initial_iteration() also gives me the correct number

  
  
Posted 3 years ago

JitteryCoyote63 no I think this is all controlled from the python side.
Let me check something

  
  
Posted 3 years ago

After you call task.set_initial_iteration(0) what do you get with task.get_initial_iteration() , is it 0 ?

  
  
Posted 3 years ago

AgitatedDove14 Didn’t work 😞

  
  
Posted 3 years ago

The problems comes from ClearML that thinks it starts from iteration 420, and then adds again the iteration number (421), so it starts logging from 420+421=841

JitteryCoyote63 Is this the issue ?

  
  
Posted 3 years ago

AgitatedDove14 yes 🙂

  
  
Posted 3 years ago

So I shouldn’t even need to call the 

task.set_initial_iteration

  function

I think just removing this call should solve it, I think that what's going on is that this is called twice (once internal once manually by your code)

  
  
Posted 3 years ago

Still the same problem 😞

  
  
Posted 3 years ago

AgitatedDove14 any chance you found something interesting? 🙂

  
  
Posted 3 years ago

JitteryCoyote63 that makes total sense!!
The reporting subprocess is not being updated with the new value! Let me check how we can pass it along...

  
  
Posted 3 years ago

ClearML has a task.set_initial_iteration , I used it as such:
checkpoint = torch.load(checkpoint_fp, map_location="cuda:0") Checkpoint.load_objects(to_load=self.to_save, checkpoint=checkpoint) task.set_initial_iteration(engine.state.iteration)But still the same issue, I am not sure whether I use it correctly and if it’s a bug or not, AgitatedDove14 ? (I am using clearml 1.0.4rc1, clearml-agent 1.0.0)

  
  
Posted 3 years ago

I might gave an idea, could you test with:
` from clearml import Task
Task._report_subprocess_enabled = False

...

real code here `

  
  
Posted 3 years ago

Let me see if I can reproduce something

  
  
Posted 3 years ago

AgitatedDove14 Unfortunately no, I already had the problem before using the function, I added it hoping it would fix the issue but it didn’t

  
  
Posted 3 years ago

If the reporting is done on a subprocess, I can imagine that the task.set_initial_iteration(0) call is only effective in the main process, not in the subprocess used for reporting. Could it be the case?

  
  
Posted 3 years ago

Is there any logic on the server side that could change the iteration number?

  
  
Posted 3 years ago

Still investigating, task.data.last_iteration is correct (equal to engine.state["iteration"] ) when I resume the training

  
  
Posted 3 years ago

Yes, I get 0 afterwards

  
  
Posted 3 years ago

Although task.data.last_iteration  is correct when resuming, there is still this doubling effect when logging metrics after resuming 😞

  
  
Posted 3 years ago

Hmm so I guess the actual code adds it into the reporting itself ...
How about we call:
task.set_initial_iteration(0)

  
  
Posted 3 years ago

JitteryCoyote63

somehow the previous iterations, not sure yet if it’s coming from my code, ignite or clearml

ClearML will automatically continue reporting from the previous iteration (i.e. if before continuing the Task the last iteration was 100, then the next report with iteration =0 will actually be 101)

task.set_initial_iteration(engine.state.iteration)

Basically it is called automatically by ClearML (obviously only when you continue an aborted Task)

  
  
Posted 3 years ago

Hi JitteryCoyote63 , I cannot reproduce it... when I call set initial iteration 0, it does what I'm expecting, and resend the scalar. I tested with the clearml ignite example, any thoughts on how I can reproduce?

  
  
Posted 3 years ago

The jump in the loss when resuming at iteration 31 is probably another issue -> for now I can conclude that:
I need to set sdk.development.report_use_subprocess = false I need to call task.set_initial_iteration(0)

  
  
Posted 3 years ago

Mmmh unfortunately not easily… I will try to debug deeper today, is there a way to resume a task from code to debug locally?
Something like replacing Task.init with Task.get_task so that Task.current_task is the same task as the output of Task.get_task

  
  
Posted 3 years ago

AgitatedDove14 I do continue an aborted Task yes - So I shouldn’t even need to call the task.set_initial_iteration function, interesting! Do you have any ideas what could be a reason of the behavior I am observing? I am trying to find ways to debug it

  
  
Posted 3 years ago

btw I monkey patched ignite’s function global_step_from_engine to print the iteration and passed the modified function to the ClearMLLogger.attach_output_handler(…, global_step_transform=patched_global_step_from_engine(engine)) . It prints the correct iteration number when calling ClearMLLogger.OutputHandler.__ call__ .
` def call(self, engine: Engine, logger: ClearMLLogger, event_name: Union[str, Events]) -> None:

    if not isinstance(logger, ClearMLLogger):
        raise RuntimeError("Handler OutputHandler works only with ClearMLLogger")

    metrics = self._setup_output_metrics(engine)

    global_step = self.global_step_transform(engine, event_name)  # type: ignore[misc]

    if not isinstance(global_step, int):
        raise TypeError(
            f"global_step must be int, got {type(global_step)}."
            " Please check the output of global_step_transform."
        )

    for key, value in metrics.items():
        if isinstance(value, numbers.Number) or isinstance(value, torch.Tensor) and value.ndimension() == 0:
            logger.clearml_logger.report_scalar(title=self.tag, series=key, iteration=global_step, value=value)
        elif isinstance(value, torch.Tensor) and value.ndimension() == 1:
            for i, v in enumerate(value):
                logger.clearml_logger.report_scalar(
                    title=f"{self.tag}/{key}", series=str(i), iteration=global_step, value=v.item()
                )
        else:
            warnings.warn(f"ClearMLLogger output_handler can not log metrics value type {type(value)}") `I don’t understand how it can log a wrong iteration if the  ` global_step `  var has the right value in this function
  
  
Posted 3 years ago

Trying now your code… should take a couple of mins

  
  
Posted 3 years ago

AgitatedDove14 I made some progress:
In clearml.conf of the agent, I set: sdk.development.report_use_subprocess = false (because I had the feeling that Task._report_subprocess_enabled = False wasn’t taken into account) I’ve set task.set_initial_iteration(0) Now I was able to get the followin graph after resuming -

  
  
Posted 3 years ago
1K Views
30 Answers
3 years ago
one year ago
Tags