组件版本
hive 2.3.7
hadoop 2.7.2
spark 2.4.3
Spark 配置
spark on yarn 配置
spark 的配置版本为/Applications/bigsoft/spark-2.4.3-bin-hadoop2.7/
hadoop 需要修改的配置文件为/Applications/bigsoft/hadoop-2.7.2/bin/hadoop
<property>
<description>Whether to enable log aggregation</description>
<name>yarn.log-aggregation-enable</name>
<value>true</value>
</property>
<property>
<name>yarn.log.server.url</name>
<value>http://localhost:19888/jobhistory/logs</value>
</property>
</configuration>
复制代码
yarn 的 capacity-scheduler.xml 文件修改配置保证资源调度按照 CPU + 内存模式:
<property>
<name>yarn.scheduler.capacity.resource-calculator</name>
<!-- <value>org.apache.hadoop.yarn.util.resource.DefaultResourceCalculator</value> -->
<value>org.apache.hadoop.yarn.util.resource.DominantResourceCalculator</value>
</property>
复制代码
修改 mapred-site.xml 的内容
<property>
<name>mapreduce.jobhistory.address</name>
<value>master:10020</value>
</property>
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>master:19888</value>
</property>
复制代码
spark 下的目录配置为spark-env.sh 添加如下配置
export HADOOP_HOME=/Applications/bigsoft/hadoop-2.7.2
export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop
export YARN_CONF_DIR=${HADOOP_HOME}/etc/hadoop
export SCALA_HOME=/Applications/bigsoft/scala-2.12.8/bin
export SPARK_MASTER_IP=localhost
export SPARK_WORKER_MEMORY=2g
export SPARK_HISTORY_OPTS="-Dspark.history.ui.port=18018 -Dspark.history.fs.logDirectory=hdfs:///user/spark/applicationHistory"
复制代码
spark-default.xml 修改配置内容
spark.eventLog.dir=hdfs:///user/spark/applicationHistory
spark.eventLog.enabled=true
spark.yarn.historyServer.address=http://localhost:18018
复制代码
启动 spark
${SPARK_HOME}/sbin/start-all.sh
复制代码
测试运行
spark-shell
val text=sc.textFile("/tmp/test/hive.log")
text.flatMap(s=>s.split(" ")).map(s=>(s,1)).reduceByKey((x,y)=>x+y).collect().foreach(kv=>println(kv))
复制代码
查看任务:
配置 hive on spark
cp /Applications/bigsoft/apache-hive-2.3.7-bin/conf/hive-site.xml /Applications/bigsoft/spark-2.4.3-bin-hadoop2.7/conf
复制代码
评论