Spark - Why is it necessary to collect() to the driver node before printing an RDD? Can it not be done in parallel?

Spark - Why is it necessary to collect() to the driver node before printing an RDD? Can it not be done in parallel?

I was reading about how to print RDDs in Spark (I'm using Java), and it seems like most people just collect() (if the RDD is small enough) and use forall(println), or something like that. Is it not possible to print in parallel? Why do we have to collect the data onto the driver node in order to print?

collect()

I was thinking maybe it's because we can't use System.out in parallel, but I feel like that's not it. And furthermore, I'm not quite sure how one would even distribute the data and print parallelly, in terms of code. One approach I was thinking of was to do a mappartitions that doesn't do anything useful in terms of mapping, but it iterates through the partition and prints its contents.

By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

搜尋此網誌

Ciugk

Spark - Why is it necessary to collect() to the driver node before printing an RDD? Can it not be done in parallel?

Spark - Why is it necessary to collect() to the driver node before printing an RDD? Can it not be done in parallel?

Popular posts from this blog

Visual Studio Code: How to configure includePath for better IntelliSense results

Spring cloud config client Could not locate PropertySource

Makefile test if variable is not empty