Generally like the kedro project and pipeline setup that I have seen so far, but haven’t started using it in anger yet. Been looking at clearml as well, so wanted to check how well these two work together
AgitatedDove14 , we are also in the same boat. We tried Kedro and found the organizational aspect to be really clean and would love to stick to it. We also like how each node of the pipeline are independent re-usable blocks.
ClearML is definitely more comprehensive (specially the concept of Tasks, Data and agents) and has its special place in our project. Now, we are trying to figure out how to run our Kedro pipelines in ClearML.
After playing with both for few days, we still cant wrap our heads around integrating both.
We have a following simple use case right now:
- Download images from S3 bucket
- Do pre-processing on the images (Add new objects in foreground etc)
- Run existing ML models on these images to generate annotations (.txt files)
- Override labels for some of the images (since we know what kind of images are these)
- Create the required directory structure for https://github.com/ultralytics/yolov5
- Start the training ( python train.py )
Each of these steps, [2], [3], [4], [5 & 6]
can be thought of as an independent Kedro nodes that can be reused in the future. Now, how to integrate this with ClearML is unclear to us.
What we tried so far:
We found that someone in this community has already tried this. We took https://github.com/noklam/allegro_test/ and added Task.init()
to each of the nodes. https://github.com/noklam/allegro_test/blob/main/src/allegro_test/pipelines/data_engineering/nodes.py#L41 .
We also added Task.execute_remotely()
so that this node will not be executed immediately.
Then, we added one Task.init()
to the https://github.com/noklam/allegro_test/blob/main/src/allegro_test/pipelines/data_engineering/pipeline.py#L39 also.
However, running kedro run
later did not run the pipeline and we did not get the logs in ClearML UI.
Even if we fix the logging issue, we are not confident if the design approach is the right now.
We also have our doubts, whether each small independent node should actually be Task.init()
?
Any help would be greatly appreciated!
TL;DR: We are confused how to incorporate the "Authoring pipelines" goal of Kedro (which we really like) into ClearML.
TrickySheep9
Is there a way to see a roadmap on such things
? (edited)
Hmm I think we have some internal one, I have to admit these things change priority all the time (so it is hard to put an actual date on them).
Generally speaking, pipelines with functions should be out in a week or so, TaskScheduler + Task Triggers should be out at about the same time.
UI for creating pipelines directly from the web app is in the working, but I do not have a specific ETA on that
So the main difference is kedro pipelines are function based steps (I might be overly simplifying, so please take it with a grain of salt), while in ClearML pipeline is Job, i.e. it needs its own environment and is longer than a few seconds (as opposed to a single function)
This actually ties well with the next version of pipelines we are working on
Is there a way to see a roadmap on such things AgitatedDove14 ?
The same can be said for ClearML, each of these steps is a clearml Task (with it's own repo/environment)
I think, the tasks are too small to merit a separate repo/environment. One example is a node that resizes the images, this node receives as input a Dataset and iterates over each image, resizes it an outputs a new Dataset, which is used in the next node downstream in the Pipeline.
the main use for Kedro is the nice web UI of the pipeline
To be frank, we have not even seen the UI yet 🙂 . The main benefit of Kedro is in the "authoring of the pipeline". You can quickly browse the organization structure https://github.com/noklam/allegro_test/tree/main/src/allegro_test/pipelines/data_science and see for yourself (this pipeline has two nodes train_model
and predict
)
You nicely described the features of ClearML and that is why we are inclined in using it. Its just that we would like to use Kedro's structure with ClearML.
AgitatedDove14
That's definitely very easy, I'm still not sure how Kedro scales on clusters. From what I saw, and I might have missed it, it seems more like a single instance with sub-processes, but no real ability to setup diff environment for the diff steps in the pipeline, is this correct ?
sub-processes is an option but it supports much more: https://kedro.readthedocs.io/en/stable/10_deployment/01_deployment_guide.html one can containerise the whole pipeline and run it pretty much anywhere. So I don't think the view of single instance is up-to-date
This actually ties well with the next version of pipelines we are working on. Basically like kubeflow add a decorator to a function making the fucntion a step in the pipeline (and a Task in ClearML).
My thinking was somehow separate short/simple steps (i.e. functions), from complicated steps (e.g. training with specific requirements).
Maybe Kedro can launch the "simple steps"? what do you think?
I might be misunderstanding things. My thinking was I can use one command and run all steps locally while still registering all "nodes/functions/inputs/outputs etc" with clearml such that I could also then later go into the interface and clone any of the individual steps and run them again. Completely independent of simple or hard steps. With another command I could also just pseudo run the pipeline with kedro locally to register everything in clearml and then run it on a clearml agent. I thought that in both cases I would need to create a PipelineController Task at the end with the full pipeline included I could even just clone that one. The latter is not working yet while the former (individual tasks) is already working except some python environment issues.
The other challenge I have come across is that using Task.init really just works if it is run in the script file itself right. If I want to use a hook system (e.g. kedro provides hooks for running callbacks before and after nodes/tasks) I can create new tasks but as the "Task.init()" is not technically run in the script that contains the source code the tracking is really challenging. Is there a way to use Task as a decorator on a function level?
All that said I might be going too deep in how I want to integrate the two frameworks in ways that is beyond the scope....
AgitatedDove14 , we did not do more experimentation on clearml/kedro later and moved to Dagster.. but am still keeping an eye out for clearML :)
Each of these steps,
[2], [3], [4], [5 & 6]
can be thought of as an independent Kedro nodes that can be reused in the future. Now, how to integrate this with ClearML is unclear to us.
The same can be said for ClearML, each of these steps is a clearml Task (with it's own repo/environment)
It sounds (and I might be completely off here, so please feel free to correct me) that the main use for Kedro is the nice web UI of the pipeline (which I
agree looks very cool).
The real power of clearml is the ability to very quickly generate those Task from code have the orchestration & scheduling combing with the agent to actually run and monitor the jobs, these are feature that do not exist in Kedro.
What am I missing here?
One example is a node that resizes the images, this node receives as input a Dataset and iterates over each image, resizes it an outputs a new Dataset, which is used in the next node downstream in the Pipeline.
I agree, this sounds like a "function" rather than a job, so better suited for Kedro.
organization structure
and see for yourself (this pipeline has two nodes
train_model
and
predict
)
Interesting! let me dive into that and get back to you after I better understand the use case 🙂
I am going to be experimenting a bit as well, will get back on this topic in a couple of weeks 🙂
Hi JealousParrot68
clearml tracking of experiments run through kedro (similar to tracking with mlflow)
That's definitely very easy, I'm still not sure how Kedro scales on clusters. From what I saw, and I might have missed it, it seems more like a single instance with sub-processes, but no real ability to setup diff environment for the diff steps in the pipeline, is this correct ?
I think the challenge here is to pick the right abstraction matching. E.g. should a node in kedro (which usually is one function but can also be more involved) be equivalent to a task or should a pipeline be a task?
This actually ties well with the next version of pipelines we are working on. Basically like kubeflow add a decorator to a function making the fucntion a step in the pipeline (and a Task in ClearML).
My thinking was somehow separate short/simple steps (i.e. functions), from complicated steps (e.g. training with specific requirements).
Maybe Kedro can launch the "simple steps"? what do you think?
I am writing a small plugin for kedro/clearml atm that tries to link up kedro with clearml. Would be interesting to share experience and get input from the clearml people at some point.
YES! please share that sounds great!
Also is it good practice to reuse task_ids when running the same job twice during debugging or always create a new one.
Hmm good point, this is why you can configure the behavior in clearml.conf (or disable it altogether) , currently we assume that if not artifacts/models were used and the last time you executed the Task was under 72h ago, the Task ID will be used (assuming running from the same machine)
Does that mean the entire pipeline will be running on the instance spinning the container ?
From here: this is what I understand:
Yes I think that is the easiest case, however I don't think it would be all that difficult to add meta data to the nodes that specifies on what kind of queue or node it should be run.
Yep, this is exactly what's coming in the next release of Pipelines (RC should be out in a week or so)
Well if that is coming out soon I'll wait with further development of the plugin as anything now would probably be too hacky anyway.
Depends on what you want to do, what do you want to do ?
(Just a thought, maybe we just need to combine Kedro-Viz ?)
one can containerise the whole pipeline and run it pretty much anywhere.
Does that mean the entire pipeline will be running on the instance spinning the container ?
From here: this is what I understand:
https://kedro.readthedocs.io/en/stable/10_deployment/06_kubeflow.html
My thinking was I can use one command and run all steps locally while still registering all "nodes/functions/inputs/outputs etc" with clearml such that I could also then later go into the interface and clone any of the individual steps and run them again.
That is absolutely correct 🙂
With another command I could also just pseudo run the pipeline with kedro locally to register everything in clearml and then run it on a clearml agent.
Sure this will work
I thought that in both cases I would need to create a PipelineController Task at the end with the full pipeline included I could even just clone that one.
This is exactly how the pipeline is designed, and cloning and running the pipeline controller should work and launch the entire pipeline (usually the controller is executed on the services queue, and the pipeline Tasks are launched on a GPU or a CPU queue)
If I want to use a hook system (e.g. kedro provides hooks for running callbacks before and after nodes/tasks)
Yes I'm with you I think this is the main challenge here.
Is there a way to use Task as a decorator on a function level?
Yep, this is exactly what's coming in the next release of Pipelines (RC should be out in a week or so)
AgitatedDove14 . HollowKangaroo16 have you two had any further success on the kedro/clearml front?
I have been looking into this as well. The impression I have so far is that clearml is similar to mlflow just on steroids because it provides additional capabilities around orchestration and experimentation.
AgitatedDove14
Kedro in my opinion is a really nice tool to keep a clean code base for building complex Data Science projects (consisting of one or more pipelines). The UI is really secondary to the abstractions/separation of concerns they provide which are the really powerful components in my opinion. From my point of view kedro/cleaml could be used together in several ways:
clearml tracking of experiments run through kedro (similar to tracking with mlflow) clearml tracking and deployment of whole workflows designed in kedroI think the challenge here is to pick the right abstraction matching. E.g. should a node in kedro (which usually is one function but can also be more involved) be equivalent to a task or should a pipeline be a task?
Kedro projects/pipelines can be already deployed to argo workflows/ airflow / databricks and some others for execution so adding clearml would be really interesting.
I am writing a small plugin for kedro/clearml atm that tries to link up kedro with clearml. Would be interesting to share experience and get input from the clearml people at some point.
The really interesting things arise when you run part of the pipelines in kedro on a local machine or within a clearml agent and keep a good record of those.
Also is it good practice to reuse task_ids when running the same job twice during debugging or always create a new one. A lot of questions 😉
If anyone is interested in exploring this more let me know!