实验室宅男的一亩三分地

posts - 15, comments - 10, trackbacks - 0, articles - 0

关于map/reduce的combiner运行时机的问题

Posted on 2012-11-06 23:52 whspecial 阅读(956) 评论(0) 编辑收藏引用所属分类: hadoop

map/reduce的combiner到底在什么时候运行？

在网上大多数资料中，都是说combiner在map端运行，发生在map输出数据之后，经过combiner再传递给reducer。但是之前在工作中出现的一个问题导致我发现原来combiner居然也会在reducer端运行，并且会多次运行。
在网上查了之后发现，这是hadoop-0.18版本引入的新feature：
Changed policy for running combiner. The combiner may be run multiple times as the map's output is sorted and merged. Additionally, it may be run on the reduce side as data is merged. The old semantics are available in Hadoop 0.18 if the user calls: job.setCombineOnlyOnce(true)。
实际上combiner会在mapper端和reducer端分别运运行，看了下代码，发生combine的时机在以下：
1）在mapper端的spill阶段，在缓存中的记录超过阈值时会进行combine

if (spstart != spindex) {

…

combineAndSpill(kvIter, combineInputCounter);

}

2）在mapper端的merge阶段，进行merge的spill文件数目>=3时会进行combine

if (null == combinerClass || numSpills < minSpillsForCombine) {

Merger.writeFile(kvIter, writer, reporter);

} else {

combineCollector.setWriter(writer);

combineAndSpill(kvIter, combineInputCounter);

}

3）在reducer端，一定会进行combine

只有注册用户登录后才能发表评论。


相关文章: 跨机房的hadoop集群 Dremel存储格式解析 Orcfile文件格式解析（2） Orcfile文件格式解析（1）关于map/reduce的combiner运行时机的问题

网站导航: 博客园 IT新闻 BlogJava 博问 Chat2DB 管理

实验室宅男的一亩三分地

导航

常用链接

留言簿

随笔分类

随笔档案

搜索

最新评论

阅读排行榜

评论排行榜

关于map/reduce的combiner运行时机的问题