++wythern++

X presents Y for a better Z

[Collection] Spark partition related things.

Partition:
Understanding:
1. https://www.dezyre.com/article/how-data-partitioning-in-spark-helps-achieve-more-parallelism/297
2. http://dev.sortable.com/spark-repartition/ -- example of partition & repartition to avoid data-imbalance.
3. https://acadgild.com/blog/partitioning-in-spark/ -- real case on existing partitioner & self-created partitioner.

Programming guidence.
Avoid using GroupByKey https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/best_practices/prefer_reducebykey_over_groupbykey.html

Reference 1 says: Applying transformations that return RDDs with specific partitioners. Some operation on RDDs that hold to and propagate a partitioner are-

Join
LeftOuterJoin
RightOuterJoin
groupByKey
reduceByKey
foldByKey
sort
partitionBy
foldByKey

groupByKey is one of them, My understanding is such operations may cause extra shuffle, but repartition also helps relieve data imbalance if well considered, so use head please! :)

posted on 2017-05-18 14:29 wythern 阅读(135) 评论(0) 编辑收藏引用

只有注册用户登录后才能发表评论。
【推荐】100%开源！大型工业跨平台软件C++源码提供，建模，组态！



网站导航: 博客园 IT新闻 BlogJava 博问 Chat2DB 管理

++wythern++

导航

统计

常用链接

留言簿(5)

随笔档案

搜索

最新评论

阅读排行榜

评论排行榜

[Collection] Spark partition related things.