写点什么

spark2.0 笔记

作者:Clarke
  • 2021 年 11 月 26 日
  • 本文字数:815 字

    阅读完需:约 3 分钟

这个 spark code generation 的 feature 的首次 commit

https://github.com/apache/spark/commit/3c0d2365d57fc49ac9bf0d7cc9bd2ef633fb5fb6

核心类 WholeStageCodegenExec.scala

重要的注释

/**


  • WholeStageCodegen compiles a subtree of plans that support codegen together into single Java

  • function.

  • Here is the call graph of to generate Java source (plan A supports codegen, but plan B does not):

  • WholeStageCodegen Plan A FakeInput Plan B

  • =========================================================================

  • -> execute()

  • doExecute() ---------> inputRDDs() -------> inputRDDs() ------> execute()

  • doConsume() <-------- consume()

  • SparkPlan A should override doProduce() and doConsume().

  • doCodeGen() will create a CodeGenContext, which will hold a list of variables for input,

  • used to generated code for [[BoundReference]].*/


这个 commit 解释了 explain 方法如何标注了哪些 stage 被合并为一个 stage 了

https://github.com/apache/spark/commit/0e70fd61b4bc92bd744fc44dd3cbe91443207c72


这个 commit 解释了如何充分利用 JIT 自动触发的 SIMD 优化

https://github.com/apache/spark/commit/fcb68e0f5d49234ac4527109887ff08cd4e1c29f


重要的相关的参考链接

https://jaceklaskowski.gitbooks.io/mastering-spark-sql/content/spark-sql-whole-stage-codegen.html

https://janino-compiler.github.io/janino/

https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/6122906529858466/293651311471490/5382278320999420/latest.html

https://databricks.com/blog/2016/05/23/apache-spark-as-a-compiler-joining-a-billion-rows-per-second-on-a-laptop.html

https://databricks.com/blog/2016/05/11/apache-spark-2-0-technical-preview-easier-faster-and-smarter.html

发布于: 14 小时前阅读数: 7
用户头像

Clarke

关注

还未添加个人签名 2018.04.15 加入

还未添加个人简介

评论

发布
暂无评论
spark2.0笔记