全部标签 



写点什么

登录注册

spark2.0 笔记

作者：Clarke

2021 年 11 月 26 日
本文字数：815 字
阅读完需：约 3 分钟

这个 spark code generation 的 feature 的首次 commit

https://github.com/apache/spark/commit/3c0d2365d57fc49ac9bf0d7cc9bd2ef633fb5fb6

核心类 WholeStageCodegenExec.scala

重要的注释

/**

WholeStageCodegen compiles a subtree of plans that support codegen together into single Java
function.
Here is the call graph of to generate Java source (plan A supports codegen, but plan B does not):
WholeStageCodegen Plan A FakeInput Plan B
=========================================================================
-> execute()
doExecute() ---------> inputRDDs() -------> inputRDDs() ------> execute()
doConsume() <-------- consume()
SparkPlan A should override doProduce() and doConsume().
doCodeGen() will create a CodeGenContext, which will hold a list of variables for input,
used to generated code for [[BoundReference]].*/

这个 commit 解释了 explain 方法如何标注了哪些 stage 被合并为一个 stage 了

https://github.com/apache/spark/commit/0e70fd61b4bc92bd744fc44dd3cbe91443207c72

这个 commit 解释了如何充分利用 JIT 自动触发的 SIMD 优化

https://github.com/apache/spark/commit/fcb68e0f5d49234ac4527109887ff08cd4e1c29f

重要的相关的参考链接

https://jaceklaskowski.gitbooks.io/mastering-spark-sql/content/spark-sql-whole-stage-codegen.html

https://janino-compiler.github.io/janino/

https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/6122906529858466/293651311471490/5382278320999420/latest.html

https://databricks.com/blog/2016/05/23/apache-spark-as-a-compiler-joining-a-billion-rows-per-second-on-a-laptop.html

https://databricks.com/blog/2016/05/11/apache-spark-2-0-technical-preview-easier-faster-and-smarter.html

发布于: 14 小时前阅读数: 7

版权声明: 本文为 InfoQ 作者【Clarke】的原创文章。

原文链接:【http://xie.infoq.cn/article/61be7f36d190b67e1bfc21a93】。文章转载请联系作者。

Clarke

关注

还未添加个人签名 2018.04.15 加入

还未添加个人简介

评论

发布

暂无评论