Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hi, Do You Know How To Upload Pyspark Dataframes With Clearml As Artifact? For Example, I Have Code:

Hi,

do you know how to upload pyspark dataframes with clearml as artifact?

For example, i have code:

task = Task.init(
	project_name="Try to upload pyspark df",
	task_name="ExampleTask",
)

df = spark.read.csv(path)

task.upload_artifact("my_df_name", df)

It dosen't work, but if i make toPandas() everything works fine:

pdf = df.toPandas()
task.upload_artifact("my_df_name", pdf)

The problem with this is that toPandas() is very slow and overloads RAM. We use pyspark for processing big data, exactly for this reasons: can parallelize the calculations and speed up the pipeline (or even make it calculable).
Lack of ability to upload pyspark dataframes makes clearML unusable for larger data sets and real problems :(

Do you have any solution for that?

Note:
task.upload_artifact("my_df_name", df.coalesce(1)) also dosen't work

Thanks for help!

  
  
Posted 3 months ago
Votes Newest

Answers 3


Anyhow, there is a serialization_function argument you could use in upload_artifact. I could imagine that we don’t properly serialize your artifacts. You could use the argument to pass a callback that would eficiently serialize the artifact. Notice that getting the artifact back requires a deserialization function

  
  
Posted 3 months ago

Hi @<1547752791546531840:profile|BeefyFrog17> ! Are you getting any exception trace when you are trying to upload your artifact?

  
  
Posted 3 months ago

Hi @<1523701435869433856:profile|SmugDolphin23> ! Thanks for answer. This is my error:

RuntimeError: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.

Can you show me example of your solution?

  
  
Posted 3 months ago
224 Views
3 Answers
3 months ago
3 months ago
Tags