Spark - Why is it necessary to collect() to the driver node before printing an RDD? Can it not be done in parallel?

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP


Spark - Why is it necessary to collect() to the driver node before printing an RDD? Can it not be done in parallel?



I was reading about how to print RDDs in Spark (I'm using Java), and it seems like most people just collect() (if the RDD is small enough) and use forall(println), or something like that. Is it not possible to print in parallel? Why do we have to collect the data onto the driver node in order to print?


collect()



I was thinking maybe it's because we can't use System.out in parallel, but I feel like that's not it. And furthermore, I'm not quite sure how one would even distribute the data and print parallelly, in terms of code. One approach I was thinking of was to do a mappartitions that doesn't do anything useful in terms of mapping, but it iterates through the partition and prints its contents.









By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

Popular posts from this blog

Makefile test if variable is not empty

Will Oldham

'Series' object is not callable Error / Statsmodels illegal variable name