Using the PipelineController with add_function_step
Alternatively, it would be good to specify both some requirements and auto-detect 🤔
Well the individual tasks do not seem to have the expected environment.
I have no idea what’s the difference, but it does not log the internal repository 😞 If I knew why, I would be able to solve it myself… hehe
There's no decorator, just e.g.
def helper(foo: Optional[Any] = None):
return foo
def step_one(...):
# stuff
Then the type hints are not removed from helper and the code immediately crashes when being run
I’ve tracked it down further, it seems the pigar utility does not apply any smart logic there.
The case we have is the following -
- We have a monorepo, but all modules/libs share a common namespace
foo
; so e.g. working on modulemod
, we usefrom foo.mod import …
- This then looks for a module called
foo
, even though it’s just a namespace - In the dist-info requirement, it seems any hyphen, dot, etc are swapped for an underscore, so our site-packages represents this as
foo_mod-x.y.z-distinfo
- This ends showing the available package is
foo_mod
- Specifically since
foo
is not generated, it is assumed local and dropped 🤔
So from foo.mod import
"translates" to foo-mod @ git+
None ..
?
There's code that strips the type hints from the component function, just think it should be applied to the helper functions too :)
Yes. Though again, just highlighting the naming of foo-mod
is arbitrary. The actual module simply has a folder structured with an implicit namespace:
foo/
mod/
__init__.py
# stuff
FWIW, for the time being I’m just setting the packages to all the packages the pipeline tasks sees with:
packages = get_installed_pkgs_detail()
packages = [f"{name}=={version}" if version else name for name, version in packages.values()]
packages = task.data.script.requirements.get('pip', task.data.script.requirements.get('poetry')) or packages
print(f"Task requirements:\n{packages}")
tmp_requirements_file = "tmp_reqs.txt"
with open(tmp_requirements_file, "w") as f:
f.write("\n".join(packages) if isinstance(packages, list) else packages)
# ...
pipe.add_function_step(..., packages=tmp_requirements_file)
How can I ensure tasks in a pipeline have the same environment as the pipeline itself?
...
but the tasks (executed remotely) do not use that same environment?
Just verifying, we are talking about pipeline decorators?
We also wanted this, we preferred to create a docker image with all we need, and let the pipeline steps use that docker image
You can specify the docker on the decorator itself:
None
Regrading capturing the packages, if you import them inside the decorated package, they will be captured based on what is installed in the local (i.e. initial) environment. The idea is that the components are Not the same as the logic, basically the logic of the pipeline should not have any real package requirement, only the components (actually doing something), should. What am I missing ?
- This then looks for a module called
foo
, even though it’s just a namespaceI think this is the issue, are you using python package name spaces ?
(this is a PEP feature that is really rarely used, and I have seen break too many times)
Assuming you have fromfrom foo.mod import
what are you seeing in pip freeze ? I'd like to see if we can fix this, and better support namespaces
I’d like to refrain from manually specifying the dependencies, since it adds a lot of overhead to extend
I think this is the main issue, is this reproducible ? How can we test that?
We can change the project name’s of course, if there’s a suggestion/guide that will make them see past the namespace…
We’d be happy if ClearML captures that (since it uses e.g. pip, then we have the git + commit hash for reproducibility), as it claims it would 😅
Any thoughts CostlyOstrich36 ?
Still; anyone? 🥹 CostlyOstrich36 AgitatedDove14
PricklyRaven28 That would be my fallback, it would make development much slower (having to build containers with every small change)
How or why is this the issue? I great something is getting lost in translation :D
On the local machine, we have all the packages needed. The code gets sent for remote execution, and all the local packages are frozen correctly with pip.
The pipeline controller task is then generated and executed remotely, and it has all the relevant packages.
Each component it launches, however, is missing the internal packages available earlier :(
is this repo installed on the machine creating the pipeline ?
You can also manually add it here `packages={"link_to_internal_python_package",]
None
Then the type hints are not removed from helper and the code immediately crashes when being run
Oh yes I see your point, that does make sense (btw removing the type hints will solve the issue)
regardless let me make sure this is solved
not sure about this, we really like being in control of reproducibility and not depend on the invoking machine… maybe that’s not what you intend
And is this repo installed on the pipeline creating machine ?
Basically I'm asking how come it did not automatically detect it?
It is. In what format should I specify it? Would this enforce that package on various components? Would it then no longer capture import statements?
So a missing bit of information that I see I forgot to mention, is that we named our packages as foo-mod
in pyproject.toml
. That hyphen then get’s rewritten as foo_mod.x.y.z-distinfo
.
foo-mod @ git+
Exactly, it should have auto-detected the package.
… And it’s failing on typing hints for functions passed in pipe.add_function_step(…, helper_function=[…])
… I guess those aren’t being removed like the wrapped function step?
Hi UnevenDolphin73 , when you say pipeline itself you mean the controller? The controller is only in charge of handling the components. Lets say you have a pipeline with many parts. If you have a global environment then it will force a lot of redundant installations through the pipeline. What is your use case?