Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hi, I'M Using

Hi,

I'm using Task.register_abort_callback to store the latest model checkpoint, but the ergonomics of the callback feel weird to me. I have to do these workarounds to get anything uploaded to the task in the callback, which feels counter-intuitive. See e.g. this snippet, which is using a pytorch lightning trainer to save the last model checkpoint on task abort:

def on_abort_callback() -> None:
    if not self.save_last:
        return

    task: Task = Task.current_task()
    # we have to mark the task started to get the checkpoint uploaded
    if task.get_status() == "stopped":
        logger.info("Marking task as `in_progress`")
        task.started()

    logger.info("Saving last checkpoint")
    trainer.save_checkpoint(
        self.last_filepath,
        weights_only=self.save_weights_only,
    )

    # Ensure that the trainer stops gracefully
    trainer.should_stop = True

    # reset the status to stopped
    if task.get_status() == "in_progress":
        logger.info("Marking task as `stopped` again")
        task.stopped()

logger.info("Registering model checkpoint abort callback")
Task.current_task().register_abort_callback(on_abort_callback)

Apparantly, the task has already been marked as stopped when the callback is triggered and thus cannot be modified by the callback. Marking it as in_progess during the callback fixes this, but is super wonky.

Am I using the method wrong? Isn't it a common usecase to want to do something to the task on task abort?

  
  
Posted 3 months ago
Votes Newest

Answers 7


This is on clearml v1.16.4

  
  
Posted 3 months ago

This is an example of the console output of a task aborted via the webUI:

Epoch 1/29 ━╸━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 699/16945 0:04:53 • 1:55:25 2.35it/s v_num: 0.000
2024-09-16 12:52:57,263 - clearml.Task - WARNING - ### TASK STOPPING - USER ABORTED - LAUNCHING CALLBACK (timeout 30.0 sec) ###
[2024-09-16 12:52:57,284][core.callbacks.model_checkpoint][INFO] - Marking task as `in_progress`
[2024-09-16 12:52:57,309][core.callbacks.model_checkpoint][INFO] - Saving last checkpoint
Epoch 1/29 ━╸━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 701/16945 0:04:54 • 1:55:03 2.35it/s v_num: 0.000
[2024-09-16 12:52:58,214][core.callbacks.model_checkpoint][INFO] - Marking task as `stopped` again
2024-09-16 12:52:58,260 - clearml.storage - INFO - Uploading: 49.56MB to /tmp/.clearml.upload_model_0zr9bxdd.tmp
                                           0% | 0.00/49.56 MB [00:00]:
Epoch 1/29 ━╸━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 701/16945 0:04:54 • 1:55:03 2.35it/s v_num: 0.000
2024-09-16 12:52:58,330 - clearml.Task - WARNING - ### TASK STOPPING - USER ABORTED - CALLBACK COMPLETED (1.07 sec) ###
2024-09-16 12:52:58,330 - clearml.Task - WARNING - ### TASK STOPPED - USER ABORTED - STATUS CHANGED ###
#########                       30% | 15.00/49.56 MB [00:00<00:00, 138.05MB/s]:
##################7              61% | 30.00/49.56 MB [00:00<00:00, 48.89MB/s]:
############################1    91% | 45.00/49.56 MB [00:00<00:00, 65.85MB/s]:
############################### 100% | 49.56/49.56 MB [00:00<00:00, 55.91MB/s]:
Epoch 1/29 ━╸━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 702/16945 0:04:54 • 1:55:13 2.35it/s v_num: 0.000
2024-09-16 12:52:59,154 - clearml.Task - INFO - Completed model upload to 

If I don't mark the task in_progress, then the output looks like this:

Epoch 0/29 ━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 508/2532 0:03:34 • 0:13:55 2.42it/s v_num: 0.000
2024-09-13 12:12:26,973 - clearml.Task - WARNING - ### TASK STOPPING - USER ABORTED - LAUNCHING CALLBACK (timeout 30.0 sec) ###
[2024-09-13 12:12:26,974][core.callbacks.model_checkpoint][INFO] - Saving checkpoint
Epoch 0/29 ━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 509/2532 0:03:34 • 0:13:56 2.42it/s v_num: 0.000
2024-09-13 12:12:27,581 - clearml.model - WARNING - Could not update last created model in Task b281b21329e3470ebc8959e831f28ff8, Task status 'stopped' cannot be updated
Epoch 0/29 ━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 510/2532 0:03:35 • 0:13:57 2.42it/s v_num: 0.000
2024-09-13 12:12:27,678 - clearml.storage - INFO - Uploading: 49.56MB to /tmp/.clearml.upload_model_mdw9vemq.tmp
                                           0% | 0.00/49.56 MB [00:00]:
Epoch 0/29 ━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 510/2532 0:03:35 • 0:13:57 2.42it/s v_num: 0.000
2024-09-13 12:12:27,700 - clearml.Task - WARNING - ### TASK STOPPING - USER ABORTED - CALLBACK COMPLETED (0.73 sec) ###
2024-09-13 12:12:27,701 - clearml.Task - WARNING - ### TASK STOPPED - USER ABORTED - STATUS CHANGED ###
  
  
Posted 3 months ago

Hi @<1523701070390366208:profile|CostlyOstrich36> , the task is being aborted via the web UI - I have another method that catches local interrupts (exceptions like keyboard interrupts and crashes). The case is equal for running tasks via agents or just local cli

  
  
Posted 3 months ago

Hi @<1523701601770934272:profile|GiganticMole91> , how is the task being stopped in your case? Is it aborted via the web UI or through some other method? Is the task running via the agent?

  
  
Posted 3 months ago

But for sure it was aborted via the webUI? Is it possible that your method method might be interfering with this somehow? Can you disable it and check the behaviour?

  
  
Posted 3 months ago

@<1523701070390366208:profile|CostlyOstrich36> just opened an issue on this: None

  
  
Posted 3 months ago

I just tried and the result is the same. The other method only triggers on exceptions

  
  
Posted 3 months ago
141 Views
7 Answers
3 months ago
3 months ago
Tags
Similar posts