Hi! I Am Trying To Build And Run A Pipeline. I Pass My Dataset As Parameter Of Pipeline:

Answered

Hi!
I am trying to build and run a pipeline. I pass my dataset as parameter of pipeline:

pipe.add_parameter(name='dataset_df',
                       description='Initial dataset .parquet file',
                       default=dataset_df,
                       param_type="pd.DataFrame")

Then I refer to this param, try to use it in my first step of pipeline that bases on another previously generated task (which was generated on an empty dataset):

pipe.add_step(
            name=f'{step_one_name}_{n_predict}',
            base_task_id=base_task.task_id,
            execution_queue=pipeline_steps_execution_queue,
            cache_executed_step=cache,
            parameter_override={"General/dataset_df": "${pipeline.dataset_df}",
                                "General/n_predict": n_predict,
                                "General/period_size": "${pipeline.period_size}",
                                "General/preprocessing_kwargs_params": "${pipeline.preprocessing_kwargs_params}"})

But I receive an error that states that my dataset is empty, although it is not. I guess, ClearML doesn't use my dataset in the task, does not override

Could you please give any ideas how to pass my dataset into the task properly?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					MysteriousWalrus11
				
					0
					 × 1

Votes Newest

Answers 3

Thank you, guys. I've figured out the solution with your help! @<1523701205467926528:profile|AgitatedDove14> @<1537605940121964544:profile|EnthusiasticShrimp49>

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					MysteriousWalrus11
				
					0
					 × 1

I pass my dataset as parameter of pipeline:

@<1523704757024198656:profile|MysteriousWalrus11> I think you were expecting the dataset_df dataframe to be automatically serialized and passed, is that correct ?
If you are using add_step, all arguments are simple types (i.e. str, int etc.)
If you want to pass complex types, your code should be able to upload it as an artifact and then you can pass the artifact url (or name) for the next step.

Another option is to use pipeline from decorators, where the data is being passed transparently between the components (like you would expect from python code).
Check this example: None

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Hey @<1523704757024198656:profile|MysteriousWalrus11> , given your use case, did you consider passing the path to the dataset? Like an address to an S3 bucket

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					EnthusiasticShrimp49
				
					0

Write your answer

3K Views

3 Answers

2 years ago