Partition:
Understanding:
1. https://www.dezyre.com/article/how-data-partitioning-in-spark-helps-achieve-more-parallelism/297
2. http://dev.sortable.com/spark-repartition/ -- example of partition & repartition to avoid data-imbalance.
3. https://acadgild.com/blog/partitioning-in-spark/ -- real case on existing partitioner & self-created partitioner.
Programming guidence.
Avoid using GroupByKey https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/best_practices/prefer_reducebykey_over_groupbykey.html
Reference 1 says: Applying transformations that return RDDs with specific partitioners. Some operation on RDDs that hold to and propagate a partitioner are-
- Join
- LeftOuterJoin
- RightOuterJoin
- groupByKey
- reduceByKey
- foldByKey
- sort
- partitionBy
- foldByKey
groupByKey is one of them, My understanding is such operations may cause extra shuffle, but repartition also helps relieve data imbalance if well considered, so use head please! :)