for example, one notebook will be dedicated to explore columns, spot outliers and create transformations for specific column values.
This actually implies each notebook is a standalone "process", which makes a ton of sense. But this is where notebooks and proper SW design break, in traditional SW, the notebooks are actually python files, and then of course you can import one from another, unfortunately this does not work in notebooks...
If you are really keen on using notebooks I would just have a git repo with multiple files with functions you are using in the notebooks (like the transformation function etc) and use the notebooks for exploration only. where you import the python file from the notebooks with "import myfile"
Later you can build your pipeline from the logic you have in your various python script files. I hope this helps shed some light 🙂
Hey Martin, thank you for your reply!
In a practical sense - I don't want to have all my steps in one notebook. I use notebooks to explore different aspects of the process and some of them create different steps.
for example, one notebook will be dedicated to explore columns, spot outliers and create transformations for specific column values.
another notebook will be for grouping, joining and aggregating data from multiple sources etc...
I might be missing something here, and would love to continue the conversation live, if anyone is interested in helping out.
Thanks in advance,
Gilad
Yes, it does help.
Thanks @<1523701205467926528:profile|AgitatedDove14>
Hi @<1619505588100665344:profile|GrievingHare27>
My understanding is that initiating a task with
Task.init()
captures the code for the entire notebook. I'm facing difficulties when attempting to build a final training pipeline (in a separate notebook) that uses only certain functions from the other notebooks/tasks as pipeline steps.
Well this is is kind of the limit of working with jupyter notebooks, referencing code from one to another is not really feasible (of course you can start a git repo and write python script, but that is for another discussion)
That said I think that what would work for you is the pipeline decorator flow.
Basically you can have one notebook with all your functions, then when you want to build a pipeline you just create a pipeline form the functions and they will be running, if you need functions to be available to different pipeline components, just add them to the helper_functions
argument
wdyt?
fyi: the pipeline Logic itself (if created from a notebook) is the notebook Main execution, so I would clean it up a bit before making it a pipeline, basically it should look like the script linked below