Thread Re: Pipelines And How They'Re Meant To Be Used / How Long They Take To Orchestrate.

Answered

Thread re: Pipelines and how they're meant to be used / how long they take to orchestrate.

@<1523701205467926528:profile|AgitatedDove14> I appreciated your advice on how tasks should do sufficient amount of work, so I refactored this whole pipeline into just two steps, jamming all artifacts and metrics into one task. This required ~12hours of refactoring (writing a lot of functions that took task as input so I could re-use logging and keep the previous tasks backwards-compatible).

this pipeline represents all the steps required to build a dataset, train a model, and evaluate how the model impacts a downstream process, measuring performance against a control and baseline... in order to make a ship/no-ship decision ("did this model perform well?").

Thing is, now it's time to backtest this. So we want to repeat this same process for every week (time travel to the past and battle-test the model). I wrote a pipeline that basically calls a for-loop to repeat this process. I refactored the steps from 11 -> 2 to effectively test "how quickly can I run a backtest" (remove almost all overhead of task comms, make the DAG easier to compute).

But I'm finding that it STILL takes 30+ minutes to just plan this DAG. It hangs for so long after "Starting Task Execution... results page:..." and then the DAG itself takes longer than the steps in the pipeline.

In my mind, something this "straightforward" (naively parallel, basically map-reduce) should take only a minute or two to set up before the tasks are being executed. Even when I run the pipeline from my local machine, it's startling how long it takes to start "real work"

So my question is: am I doing something wrong? Are pipelines of even moderate-size expected to be spun up quickly? Should I be trying multiple clearml versions?

  				
Posted 
	4 months ago

					More
				  		
  Report
		
					SmallTurkey79
				
					0
					 × 1

Votes Newest

Answers 7

if i put pipe.start earlier in the code, the pipeline fails to execute the actual steps.

pipe.start should be called after the pipeline was constructed and should be the "last" call of the script.
Not sure I follow what is "before" the code?

  				
Posted 
	4 months ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

is it? I can't tell if these delays (DAG-computation) are pipeline-specific (i get that pipeline is just a type of task), but it felt like a different question as I'm asking "are pipelines like this appropriate?"

is there something fundamentally slower about using pipe.start() at the end of a pipeline vs pipe.run_locally() ?

  				
Posted 
	4 months ago

					More
				  		
  Report
		
					SmallTurkey79
				
					0
					 × 1

mind-blowing... but somehow just later in the same day I got the same pipeline to create its DAG and start running in under a minute.

I don't know what exactly I changed. The pipeline task was run locally (which I've never done before), then cloned to run remotely in my services queue. And then it just flew through the experiment at the pace I expected.

so there's hope. i'll keep stress-testing it and see what causes differences. I was right to suspect that such a simple DAG should not take upwards of 30-60 minutes to compute.

  				
Posted 
	4 months ago

					More
				  		
  Report
		
					SmallTurkey79
				
					0
					 × 1

pipe.start_locally() will run the DAG compute part on the same machine, where pipe.start() will start it on a remote worker (if it is not already running on a remote worker)
basically "pipe.start()" executed via an agent, will start the compute (no overhead)
does that help?

  				
Posted 
	4 months ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

None
This seems like the same discussion , no ?

  				
Posted 
	4 months ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

# imports

...

if __name__ == "__main__:

    pipe = PipelineController(...)

    # after instantiation, before "the code" that creates the pipeline.
    # normal tasks can handle task.execute_remotely() at this stage...
    pipe = add_steps_to_pipe(pipe)
    ...

    # after the pipeline is defined. best I can tell, *has* to be last thing in code.
    pipe.start_locally()  # or just .start()

  				
Posted 
	3 months ago

					More
				  		
  Report
		
					SmallTurkey79
				
					0
					 × 1

i understood that part, but noticed that when putting in the code to start remotely, the consequence seems to be that the dag computation happens twice - once on my machine as it runs, and then again remotely (this is at least part of why its slower) . if i put pipe.start earlier in the code, the pipeline fails to execute the actual steps .

this is unlike tasks, which somehow are smart enough to publish in draft form when task.execute_remotely is up top .

do i just leave off pipe.start?

  				
Posted 
	4 months ago

					More
				  		
  Report
		
					SmallTurkey79
				
					0
					 × 1

Write your answer

442 Views

7 Answers

4 months ago

3 months ago