Reputation
Badges 1
15 × Eureka!AgitatedDove14
How do you recommend to perform this task?
I mean, have a CI/CD (e.g Github Actions) thats update my “production” pipeline on ClearML UI, so a Data Scientist can start to experiment things and create jobs from the UI.
Hi there,
PanickyMoth78
I am having the same issue.
Some steps of the pipeline create huge datasets (some GBs) that I don’t want to upload or save.
Wrap the returns in a dict could be a solution, but honestly, I don’t like it.
AgitatedDove14 Is there any better way to avoid the upload of some artifacts of pipeline steps?
The image above shows an example of the first step of a training pipeline, that queries data from a feature store.
It gets the DataFrame, zip and upload it (this one i...
The transformation has nome parameters that we change eventually
I could merge some steps, but as I may want to cache them in the future, I prefer to keep them separate
that makes sense, so why don’t you point to the feature store ?
I did, the first step of the pipeline query the feature store. I mean, I set the data version as a parameter, then this steps query the data and return it (to be used in the next step)
I see now.
I didn’t know that each steps runs in a different process
Thus, the return data from step 2 needs to be available somewhere to be used in step 3.
Pipelines runs on the same machine.
We already have the feature-store to save all data, that’s why I don’t need to save it (just a reference of version of dataset).
I understand your point.
I can have different steps of the pipeline running on different machines. But this is not my use case.
this will cause them to get serialized to the local machine’s file system, wdyt?
I am about the disk space usage that may increase over time.
I just prefer do not worry about that
This is not a valid parameter: https://clear.ml/docs/latest/docs/references/sdk/task#taskinit
Also I did not find any usage example of the setup_upload
method
Thanks anyway
Found the issue.
For some reason, all parameters on the main functions are passed as strings.
So I have these parameters:
@PipelineDecorator.pipeline(name='Build Embeddings', project='kgraph', version='1.3') def main(tk_list=[], ngram_size=2): ...
The ngram_size variable is a int when using PipelineDecorator.debug_pipeline()
and it is a string when I used PipelineDecorator.run_locally()
I’ve add Python type hints and it fixed the issues:
` def main(tk_list:list = [], ngram...
I don’t think so AgitatedDove14
I’ve tested with:
PipelineDecorator.debug_pipeline() PipelineDecorator.run_locally() Docker
I’ve got no error
I've build a container using the same image used by agent.
Training ran with no errors
I've also tried with clearml-1.6.5rc2, got same error
I am lost 😔
Hi there,
This is exactly I want to do.
RoughTiger69
Have you be able to do it?
Hi MotionlessCoral18
Are you running the agent inside a container?
Would you mind to share your Dockerfile?
SubstantialElk6
Only today I've saw your comments (did not get notified for some reason)
Thanks for you suggestions
Got it!
Thanks AgitatedDove14
Thanks Martin, your suggestion solves the problem.
👍
AgitatedDove14 is that the expect behavior for Pipelines?
So, how wrap the returns in a dict could be a solution?
It will serialize the data on the dict? (leading to the same result, data storage somewhere)
AgitatedDove14 , thanks for the quick answer.
I think this is the easiest way, basically the CI/CD launches a pipeline (which under the hood is another type of Task), by querying the latest “Published” pipeline that is also Not archived, then cloning+pushing it to execution queue
Do you have an example?
UI when you want to “upgrade” the production pipeline you just right click “Publish” on the pipeline
I’ve did saw this “publish” option for pipelines, just for models, is thi...
AgitatedDove14 Worked!
But a new error raises:
` File "kgraph/pipelines/token_join/train/pipeline.py", line 48, in main
timestamp = pd.to_datetime(data_timestamp) if data_timestamp is not None else get_latest_version(feature_view_name)
File "/root/.clearml/venvs-builds/3.8/task_repository/Data-Science/kgraph/featurestore/query_data.py", line 77, in get_latest_version
fv = store.get_feature_view(fv_name)
File "/root/.clearml/venvs-builds/3.8/lib/python3.8/site-packages/feast/u...
AgitatedDove14 Thanks for the explanation
I got it.
How I can use force_requirements_env_freeze
with PipelineDecorator()
as I do not have the Task object created.@PipelineDecorator.pipeline(name='training', project='kgraph', version='1.2') def main(feature_view_name, data_timestamp=None, tk_list=None): """Pipeline to train ...