spark2.0 笔记
这个 spark code generation 的 feature 的首次 commit
https://github.com/apache/spark/commit/3c0d2365d57fc49ac9bf0d7cc9bd2ef633fb5fb6
核心类 WholeStageCodegenExec.scala
重要的注释
/**
WholeStageCodegen compiles a subtree of plans that support codegen together into single Java
function.
Here is the call graph of to generate Java source (plan A supports codegen, but plan B does not):
WholeStageCodegen Plan A FakeInput Plan B
=========================================================================
-> execute()
doExecute() ---------> inputRDDs() -------> inputRDDs() ------> execute()
doConsume() <-------- consume()
SparkPlan A should override
doProduce()
anddoConsume()
.doCodeGen()
will create aCodeGenContext
, which will hold a list of variables for input,used to generated code for [[BoundReference]].*/
这个 commit 解释了 explain 方法如何标注了哪些 stage 被合并为一个 stage 了
https://github.com/apache/spark/commit/0e70fd61b4bc92bd744fc44dd3cbe91443207c72
这个 commit 解释了如何充分利用 JIT 自动触发的 SIMD 优化
https://github.com/apache/spark/commit/fcb68e0f5d49234ac4527109887ff08cd4e1c29f
重要的相关的参考链接
https://jaceklaskowski.gitbooks.io/mastering-spark-sql/content/spark-sql-whole-stage-codegen.html
版权声明: 本文为 InfoQ 作者【Clarke】的原创文章。
原文链接:【http://xie.infoq.cn/article/61be7f36d190b67e1bfc21a93】。文章转载请联系作者。
评论