Hello Everyone, I'M Working On Building A Training Pipeline Using Clearml And I'M Encountering Some Challenges In Assembling The Pipeline.

Answered

Hello everyone,
I'm working on building a training pipeline using ClearML and I'm encountering some challenges in assembling the pipeline.

Context:

I have multiple Jupyter notebooks, each focusing on specific tasks such as data preprocessing, feature engineering, and modeling.
In each notebook I've added some functions that take care of the main logic of the process. The rest of the code is for debugging, visualization etc...
I've initiated a ClearML task in each of these notebooks using Task.init() .
Issue:
My understanding is that initiating a task with Task.init() captures the code for the entire notebook. I'm facing difficulties when attempting to build a final training pipeline (in a separate notebook) that uses only certain functions from the other notebooks/tasks as pipeline steps.

Questions:

Is there a way to register only specific functions or code blocks from a notebook as a pipeline step?
What are the best practices for assembling these 'sub-tasks' into a complete training pipeline?
I'd greatly appreciate any insights or advice to help me navigate this situation.
Thank you!

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					GrievingHare27
				
					0
					 × 1

Votes Newest

Answers 4

Hey Martin, thank you for your reply!

In a practical sense - I don't want to have all my steps in one notebook. I use notebooks to explore different aspects of the process and some of them create different steps.

for example, one notebook will be dedicated to explore columns, spot outliers and create transformations for specific column values.

another notebook will be for grouping, joining and aggregating data from multiple sources etc...

I might be missing something here, and would love to continue the conversation live, if anyone is interested in helping out.

Thanks in advance,
Gilad

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					GrievingHare27
				
					0
					 × 1

for example, one notebook will be dedicated to explore columns, spot outliers and create transformations for specific column values.

This actually implies each notebook is a standalone "process", which makes a ton of sense. But this is where notebooks and proper SW design break, in traditional SW, the notebooks are actually python files, and then of course you can import one from another, unfortunately this does not work in notebooks...

If you are really keen on using notebooks I would just have a git repo with multiple files with functions you are using in the notebooks (like the transformation function etc) and use the notebooks for exploration only. where you import the python file from the notebooks with "import myfile"
Later you can build your pipeline from the logic you have in your various python script files. I hope this helps shed some light 🙂

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Yes, it does help.
Thanks @<1523701205467926528:profile|AgitatedDove14>

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					GrievingHare27
				
					0
					 × 1

Hi @<1619505588100665344:profile|GrievingHare27>

My understanding is that initiating a task with

Task.init()

captures the code for the entire notebook. I'm facing difficulties when attempting to build a final training pipeline (in a separate notebook) that uses only certain functions from the other notebooks/tasks as pipeline steps.

Well this is is kind of the limit of working with jupyter notebooks, referencing code from one to another is not really feasible (of course you can start a git repo and write python script, but that is for another discussion)

That said I think that what would work for you is the pipeline decorator flow.
Basically you can have one notebook with all your functions, then when you want to build a pipeline you just create a pipeline form the functions and they will be running, if you need functions to be available to different pipeline components, just add them to the helper_functions argument
wdyt?

fyi: the pipeline Logic itself (if created from a notebook) is the notebook Main execution, so I would clean it up a bit before making it a pipeline, basically it should look like the script linked below

None

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Write your answer

1K Views

4 Answers

2 years ago