Hey All, Basically, Is There A Way For A Pipeline Task Method (E.G. Registered Via

Answered

Hey all,
Basically, is there a way for a pipeline task method (e.g. registered via add_function_step ) to start a task, but exit without completing it, with the expectation that a future process will continue it the task (e.g. vi a Task.init(continue_last_task=task_id) ?

Context:
My team is working on adopting clearml for pipelines. We have a pre-existing system for launching long-running jobs on an external cluster, which possibly will need to first create resources. In the most common case, we want a task to cover the full lifespan of a job in this system, so a task function submits such a job with some config that includes the task id, and when the job starts it can continue that task with Task.init(continue_last_task=task_id) , meanwhile the original process waits for the job to complete.

However sometimes we want another task to be able to start as soon as the job is submitted. Naively I tried to split our "submit-job-and-wait" into two separate tasks, "submit" and "wait", where the job process itself would try to resume the "submit" task in the manner described above. Unfortunately, this fails because the the "submitting" method finishes and marks the task as complete. By the time the job starts up and attempts to continue the task, it gets a ValueError saying Task object can only be updated if created or in_progress [status=completed fields=['hyperparams']] . Because this happens at job setup, the job fails. This is doubly bad: the job didn't get done, and the Task claims it was successful.

Roughly, in the same way that continue_last_task lets a new process pick up a task, I think we need a complementary way for the launching method to disassociate from a task which we expect to be continued elsewhere, so that the method returning does not cause the task status to change ... but I see nothing in the docs that suggests a way to do this.
Any suggestions?

  				
Posted 
	5 months ago

					More
				  		
  Report
		
					SmoggyLion3
				
					0

Votes Newest

Answers

Hi @<1834401593374543872:profile|SmoggyLion3> ! There are a few things I can think of:

If you need to continue a task that is marked as completed, you can do clearml.Task.get_task(ID).mark_stopped(force=True) to mark it as stopped. You can do this in the job that picks up the task and want to continue it before calling Task.init , or in a post_execute_callback in the pipeline itself, so the pipeline function marks itself as aborted. For example:

from clearml import PipelineDecorator


def callback(pipeline, node):
    node.job.task.mark_stopped(force=True)


@PipelineDecorator.component(post_execute_callback=callback)
def step_1():
    print("hello")
    return 1


@PipelineDecorator.component()
def step_2(arg):
    print(arg)
    

@PipelineDecorator.pipeline(name="example", project="example")
def pipe():
    ret = step_1()
    step_2(ret)


PipelineDecorator.run_locally()
pipe()

The "submitting" task could actually either:

create the task which we want to be continued with Task.create, enqueue the task and wait for it to be aborted
create the task which we want to be continued in a subprocess. To do that, first you should remove from your env CLEARML_PROC_MASTER_ID and CLEARML_TASK_ID . For example (pseudocode)

def do_work(should_continue=None):
   if not should_continue:
       Task.init("name", "project")
   else:
      Task.init(continue_last_task=should_continue)
   # some work
   if condition:
      # abort  - to be continued elsewhere

def pipe_step(should_continue):
    if should_continue:
       do_work(should_continue)
       return
    env = dict(**os.environ)
    env.pop("CLEARML_PROC_MASTER_ID", None)
    env.pop("CLEARML_TASK_ID", None)
    Process(target=do_work, env=env)

args = argparse.parse_args  # parse args with argparse, can fetch remote arguments without initializing the task
pipe_step(args.should_continue)

  				
Posted 
	5 months ago

					More
				  		
  Report
		
					SmugDolphin23
				
					0

Write your answer

615 Views

1 Answer

5 months ago