Hi, Do You Know How To Upload Pyspark Dataframes With Clearml As Artifact? For Example, I Have Code:

Answered

Hi,

do you know how to upload pyspark dataframes with clearml as artifact?

For example, i have code:

task = Task.init(
	project_name="Try to upload pyspark df",
	task_name="ExampleTask",
)

df = spark.read.csv(path)

task.upload_artifact("my_df_name", df)

It dosen't work, but if i make toPandas() everything works fine:

pdf = df.toPandas()
task.upload_artifact("my_df_name", pdf)

The problem with this is that toPandas() is very slow and overloads RAM. We use pyspark for processing big data, exactly for this reasons: can parallelize the calculations and speed up the pipeline (or even make it calculable).
Lack of ability to upload pyspark dataframes makes clearML unusable for larger data sets and real problems :(

Do you have any solution for that?

Note:
task.upload_artifact("my_df_name", df.coalesce(1)) also dosen't work

Thanks for help!

  				
Posted 
	10 months ago

					More
				  		
  Report
		
					BeefyFrog17
				
					0
					 × 1

Votes Newest

Answers 3

Anyhow, there is a serialization_function argument you could use in upload_artifact. I could imagine that we don’t properly serialize your artifacts. You could use the argument to pass a callback that would eficiently serialize the artifact. Notice that getting the artifact back requires a deserialization function

  				
Posted 
	10 months ago

					More
				  		
  Report
		
					SmugDolphin23
				
					0

Hi @<1547752791546531840:profile|BeefyFrog17> ! Are you getting any exception trace when you are trying to upload your artifact?

  				
Posted 
	10 months ago

					More
				  		
  Report
		
					SmugDolphin23
				
					0

Hi @<1523701435869433856:profile|SmugDolphin23> ! Thanks for answer. This is my error:

RuntimeError: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.

Can you show me example of your solution?

  				
Posted 
	10 months ago

					More
				  		
  Report
		
					BeefyFrog17
				
					0
					 × 1

Write your answer

690 Views

3 Answers

10 months ago