Hi, How Can I Make A Stage In A Clearml Pipeline Non-Blocking? The Scenario Is That Stages Downstream Needed Runtime Info From The First Stage, However The First Stage Needs To Continue Running To Act As A Monitor For The Other Downstream Stages.

Answered

Hi, how can i make a stage in a clearml pipeline non-blocking?
The scenario is that stages downstream needed runtime info from the first stage, however the first stage needs to continue running to act as a monitor for the other downstream stages.

  				
Posted 
	one year ago

					More  		
  Report
		
					SubstantialElk6
				
					0
					 × 1

Votes Newest

Answers 5

If we run all the rank 0 and rank n tasks individually, it's defeats the purpose of using ClearML.

  				
Posted 
	one year ago

					More  		
  Report
		
					SubstantialElk6
				
					0
					 × 1

The downstream stages are rankN scripts, they are waiting for the IP address of the first stage.

Is this like a multi-node training, rather than a pipeline ?

  				
Posted 
	one year ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Yes it is! But ClearML didn't support multi node training out of the box in a way that it streamline the process. So we are trying to figure out a way to do it.

  				
Posted 
	one year ago

					More  		
  Report
		
					SubstantialElk6
				
					0
					 × 1

The first stage is a rank0 pytorch script. The downstream stages are rankN scripts, they are waiting for the IP address of the first stage. But the first stage doesn’t return, it simply waits for the rankN scripts to connect to it. But in this case, the rankN scripts doesn’t start. So its probably necessary to have just a single stage.

If i were to start a single rank0, and subsequent rankN tasks, it would be rather messy on ClearML Dashboard. Best to have either a single clearml application or clearml pipeline to do this.

  				
Posted 
	one year ago

					More  		
  Report
		
					SubstantialElk6
				
					0
					 × 1

Hi SubstantialElk6
I would split the first stage into two. The first one passing data to the others, the second as "monitoring ", Wdyt?

  				
Posted 
	one year ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Write your answer

1K Views

5 Answers

one year ago