Hi @<1523701083040387072:profile|UnevenDolphin73> , can you please elaborate?
I'm trying to build an easy SDK that would fit DS work and fit the concept of clearml pipelines.
In doing so, I'm planning to define various Step
classes, that the user can then experiment with, providing Steps as input to other steps, etc.
Then I'd like for the user to be able to run any such step, either locally or remotely. Locally is trivial. Remotely is the issue. I understand I'll need to upload additional data to the remote instance, and pull a specific artifact back to the notebook during this execution.
What I'm not sure is how can I run a specific function/class remotely within a notebook?
Any thoughts @<1523701070390366208:profile|CostlyOstrich36> ?
I wouldn’t want to run the entire notebook, just a specific part of it.
No that does not seem to work, I get
task.execute_remotely(queue_name="default")
2024-01-24 11:28:23,894 - clearml - WARNING - Calling task.execute_remotely is only supported on main Task (created with Task.init)
Defaulting to self.enqueue(queue_name=default)
Any follow-up thoughts, @<1523701070390366208:profile|CostlyOstrich36> , or maybe @<1523701087100473344:profile|SuccessfulKoala55> ? 🤔
@<1523701083040387072:profile|UnevenDolphin73> how do you compose this chink of code? Any examples?
I can elaborate in more detail if you have the time, but generally the code is just defined in some source files.
I’ve been trying to play around with pipelines for this purpose, but as suspected, it fails finding the definition for the pickled object…
Consider e.g:
# steps.py
class DataFetchingStep:
def __init__(self, source, query, locations, timestamps):
# ...
def run(self, queue=None, **kwargs):
# ...
class DataTransformationStep:
def __init__(self, inputs, transformations):
# inputs can include instances of DataFetchingStep, or local files, for example
# ...
def run(self, queue=None, **kwargs):
# ...
And then the following SDK usage in a notebook:
from steps import DataFetchingStep, DataTransformationStep
data_step = DataFetchingStep(source='some_db', query='...', locations='...', timestamps='...')
# Run options:
data_step.run(queue=None) # Runs locally, hurray
data_step.run(queue='default') # Runs remotely, polls until the Task is done, stores the saved artifact somewhere locally in the class
transform_step = DataTransformationStep(inputs=[data_step, "my_data.csv"], transformations=[func1, func2, func3])
transform_step.run() # Runs locally; if `data_step` was not run before, it will run now; will also poll until the Task is done, etc...
Then I wonder:
- How to achieve this? The pipeline controller seems to only work with functions, not classes, so running smaller steps remotely seems more difficult then I imagined. I was already prepared to upload artifacts myself etc, but now I’m not sure?
- Do I really need to recreate the pipeline everytime from scratch? Or can I remove/edit steps? It’s mostly used as a… controller for notebook-based executions and experimentations, before the actual pipeline is known. That is, it will allow experimenting with building the desired pipeline and its steps. Ideally all these
Step
classes would result in aTask
which is much easier to handle and/or navigate.
Hey @<1523701083040387072:profile|UnevenDolphin73> what you're building here sounds like a useful tool. Let me understand what you're trying to achieve here, please correct me if I'm wrong:
- You want to create a set of
Step
classes with which you can define pipelines, that will be executed either locally or remotely. - The pipeline execution is triggered from a notebook.
- The
steps
are predefined transformations, the user normally won't have to create their own steps
Did I get all of this right?
As for your two questions
- the pipeline controller expects either functions (
.add_function_step
or via@PipelineDecorator.component
) or already executed tasks (.add_step
), not classes. - You can ofc edit the steps, but by running the pipeline you will recreate it. Maybe it will help if you define what does "recreating a pipeline" mean to you and why is it a problem
Hey @<1537605940121964544:profile|EnthusiasticShrimp49> ! You’re mostly correct. The Step
classes will be predefined (of course developers are encouraged to add/modify as needed), but as in the DataTransformationStep
, there may be user-defined functions specified. That’s not a problem though, I can provide these functions with the helper_functions
argument.
- The
.add_function_step
is indeed a failing point. I can’t really create a task from the notebook because callingTask.init(…)
and thentask.execute_remotely()
closes the notebook process.- In theory it would be nice to even just create a “Draft Task” and copy that, but that, too, requires an entry point, the code to be prepared, etc. It does not take it from a class, and I cannot simply create it from within a notebook.- I was just thinking about the clutter it will create. Such experimentations can easily create many (hundreds?) of pipeline runs. Some examples would be:- We have a DAG of “A -> B”. The user wants to run Step A first, so they can examine the output in the notebook. It creates a pipeline with a single step, and the results are polled in the notebook. After they examined it, they now want to run Step B. I can add Step B to the previous pipeline, and let it run (assuming the cache would work), but I’d rather keep the logic internal and basically only run Step B. That entails recreating “the notebook pipeline” with only a single step, B. - Same DAG, but after examining results from A, the user may change that step. I cannot rely on ClearML’s caching, because this definition might be out-of-scope for the caching (i.e. a user may have installed a package in doing so). I’d like to recreate the pipeline so that both A and B are run.
- Overall, this is not an issue, just a clutter problem, that can be resolved by automatically nesting pipelines in one location and archiving as needed.
Closing thoughts, is that the ideal case would be just creating a task. I don’t need the pipeline, really, I just need to be able to create a task from a class function, define inputs/outputs, and execute it remotely. That’s why I originally triedtask.create_function_task
, but it cannot be executed remotely and/or it will close the running process…
I guess in theory I could write a run_step.py
, similarly to how the pipeline in ClearML works… 🤔 And then use Task.create()
etc?
Ah, I think I understand. To execute a pipeline remotely you need to use None pipe.start()
not task.execute_remotely
. Do note that you can run tasks remotely without exiting the current process/closing the notebook, (see here the exit_process
argument None ) but you won't be able to return any values from this task.
Given my understanding, I would actually recommend you the following:
- For each remote run, create a new
task
from a function like this: None - Enqueue the created
task
: None this will not exit the current process - To be able to retrieve results from the remotely executed
task
, you will have to manually save artifacts in the function that the remotetask
is executing. Because you have thetask
object in your notebook process, you can then retrieve these artifacts upon completion None
Thanks @<1537605940121964544:profile|EnthusiasticShrimp49> ! That’s definitely the route I was hoping to go, but the create_function_task
is still a bit of a mystery, as I’d like to use an entire class with relevant logic and proper serialization for inputs, and potentially I’ll need to add more “helper functions” (as in the case of DataTransformationStep
, for example). Any thoughts on that? 🤔
I’ll give the create_function_task
one more try 🤔
I think also the script path in the created task will cause some issues, but let’s see…
I ’ m afraid serializing an entire class won’t be possible , but create_function_task
will send the entire environment for remote execution , so you can still access your code
Interesting, why won’t it be possible? Quite easy to get the source code using e.g. dill
.
At any case @<1537605940121964544:profile|EnthusiasticShrimp49> this seems like a good approach, but it’s not quite there yet. For example, even if I’d provide a simple def run_step(…)
function, I’d still need to pass the instance to the function. Passing it along in the kwargs
for create_function_task
does not seem to work, so now I need to also upload the inputs, etc — I’m bringing this up because the pipelines do already do this for you.
So maybe summarizing (sorry for the spam):
- Pipelines:- Pros: Automatic upload and serialization of input arguments
- Cons: Clutter, does not support classes, cannot inject code, does not recognize environment when run from e.g. IPython- Tasks:- Pros: Tidier and matches original idea, recognizes environment even when run from IPython
- Cons: Does not support classes, cannot inject code, does not automatically upload input arguments
More experiments @<1537605940121964544:profile|EnthusiasticShrimp49> - the core issue with the create_function_step
seems to be that the chosen executable will be e.g. IPython
or some notebook, and not e.g. python3.10
, so it fails running it as a task… 🤔
Hey @<1523701083040387072:profile|UnevenDolphin73> , sorry for late reply, I’m investigating now the issue that you mentioned that running a remote task with create_function_task
fails. I can’t quite reproduce it, can you please provide a complete runnable code snippet that fails like you just described
No worries @<1537605940121964544:profile|EnthusiasticShrimp49> ! I made some headway by using Task.create
, writing a temporary Python script, and using task.update
in a similar way to how pipeline steps are created.
I'll try and create an MVC to reproduce the issue, though I may have strayed from your original suggestion because I need to be able to use classes and not just functions.
@<1537605940121964544:profile|EnthusiasticShrimp49> It’ll take me still some time to find the MVC that generated this, but I do have the ClearML experiment page for it. I was running the thing from ipython
, and was trying to create a task from a function: