写点什么

docker 下的 spark 集群,调整参数榨干硬件

作者:程序员欣宸
  • 2022 年 8 月 20 日
    广东
  • 本文字数:9969 字

    阅读完需:约 33 分钟

docker下的spark集群,调整参数榨干硬件

欢迎访问我的 GitHub

这里分类和汇总了欣宸的全部原创(含配套源码):https://github.com/zq2599/blog_demos

本篇概览


  1. spark 只有一个 work 节点,只适合处理小数据量的任务,遇到大量数据的任务要消耗更多时间;

  2. hdfs 的文件目录和 docker 安装目录在一起,如果要保存大量文件,很可能由于磁盘空间不足导致上传失败;

  3. master 的 4040 和 work 的 8080 端口都没有开放,看不到 job、stage、executor 的运行情况;


  • 今天就来调整系统参数,解决上述问题;

最初的 docker-compose.yml 内容

  • 优化前的 docker-compose.yml 内容如下所示:


version: "2.2"services:  namenode:    image: bde2020/hadoop-namenode:1.1.0-hadoop2.7.1-java8    container_name: namenode    volumes:      - hadoop_namenode:/hadoop/dfs/name      - ./input_files:/input_files    environment:      - CLUSTER_NAME=test    env_file:      - ./hadoop.env    ports:      - 50070:50070    resourcemanager:    image: bde2020/hadoop-resourcemanager:1.1.0-hadoop2.7.1-java8    container_name: resourcemanager    depends_on:      - namenode      - datanode1      - datanode2    env_file:      - ./hadoop.env    historyserver:    image: bde2020/hadoop-historyserver:1.1.0-hadoop2.7.1-java8    container_name: historyserver    depends_on:      - namenode      - datanode1      - datanode2    volumes:      - hadoop_historyserver:/hadoop/yarn/timeline    env_file:      - ./hadoop.env    nodemanager1:    image: bde2020/hadoop-nodemanager:1.1.0-hadoop2.7.1-java8    container_name: nodemanager1    depends_on:      - namenode      - datanode1      - datanode2    env_file:      - ./hadoop.env    datanode1:    image: bde2020/hadoop-datanode:1.1.0-hadoop2.7.1-java8    container_name: datanode1    depends_on:      - namenode    volumes:      - hadoop_datanode1:/hadoop/dfs/data    env_file:      - ./hadoop.env    datanode2:    image: bde2020/hadoop-datanode:1.1.0-hadoop2.7.1-java8    container_name: datanode2    depends_on:      - namenode    volumes:      - hadoop_datanode2:/hadoop/dfs/data    env_file:      - ./hadoop.env    datanode3:    image: bde2020/hadoop-datanode:1.1.0-hadoop2.7.1-java8    container_name: datanode3    depends_on:      - namenode    volumes:      - hadoop_datanode3:/hadoop/dfs/data    env_file:      - ./hadoop.env
master: image: gettyimages/spark:2.3.0-hadoop-2.8 container_name: master command: bin/spark-class org.apache.spark.deploy.master.Master -h master hostname: master environment: MASTER: spark://master:7077 SPARK_CONF_DIR: /conf SPARK_PUBLIC_DNS: localhost links: - namenode expose: - 7001 - 7002 - 7003 - 7004 - 7005 - 7077 - 6066 ports: - 6066:6066 - 7077:7077 - 8080:8080 volumes: - ./conf/master:/conf - ./data:/tmp/data - ./jars:/root/jars
worker: image: gettyimages/spark:2.3.0-hadoop-2.8 container_name: worker command: bin/spark-class org.apache.spark.deploy.worker.Worker spark://master:7077 hostname: worker environment: SPARK_CONF_DIR: /conf SPARK_WORKER_CORES: 2 SPARK_WORKER_MEMORY: 1g SPARK_WORKER_PORT: 8881 SPARK_WORKER_WEBUI_PORT: 8081 SPARK_PUBLIC_DNS: localhost links: - master expose: - 7012 - 7013 - 7014 - 7015 - 8881 ports: - 8081:8081 volumes: - ./conf/worker:/conf - ./data:/tmp/data
volumes: hadoop_namenode: hadoop_datanode1: hadoop_datanode2: hadoop_datanode3: hadoop_historyserver:
复制代码


  • 接下来开始优化;

实战环境信息

  • 本次实战所用的电脑是联想笔记本:


  1. CPU:i5-6300HQ(四核四线程)

  2. 内存:16G

  3. 硬盘:256G 的 NVMe 再加 500G 机械硬盘

  4. 系统:Deepin15

  5. docker:18.09.1

  6. docker-compose:1.17.1

  7. spark:2.3.0

  8. hdfs:2.7.1

调整 work 节点数量

  • 由于内存有 16G,于是打算将 work 节点数从 1 个调整到 6 个,调整后 work 容器的配置如下:


worker1:    image: gettyimages/spark:2.3.0-hadoop-2.8    container_name: worker1    command: bin/spark-class org.apache.spark.deploy.worker.Worker spark://master:7077    hostname: worker1    environment:      SPARK_CONF_DIR: /conf      SPARK_WORKER_CORES: 2      SPARK_WORKER_MEMORY: 2g      SPARK_WORKER_PORT: 8881      SPARK_WORKER_WEBUI_PORT: 8081      SPARK_PUBLIC_DNS: localhost    links:      - master    expose:      - 7012      - 7013      - 7014      - 7015      - 8881    volumes:      - ./conf/worker1:/conf      - ./data/worker1:/tmp/dataworker2:    image: gettyimages/spark:2.3.0-hadoop-2.8    container_name: worker2    command: bin/spark-class org.apache.spark.deploy.worker.Worker spark://master:7077    hostname: worker2    environment:      SPARK_CONF_DIR: /conf      SPARK_WORKER_CORES: 2      SPARK_WORKER_MEMORY: 2g      SPARK_WORKER_PORT: 8881      SPARK_WORKER_WEBUI_PORT: 8082      SPARK_PUBLIC_DNS: localhost    links:      - master    expose:      - 7012      - 7013      - 7014      - 7015      - 8881    volumes:      - ./conf/worker2:/conf      - ./data/worker2:/tmp/data
复制代码


  • 如上所示,注意 volumes 参数,都映射在了 docker-compose.yml 同一层级的 conf 和 data 两个目录下,这里只贴出了 worker1 和 worker2 的内容,worker3-worker6 的内容都是类似的;

hdfs 的文件目录导致的磁盘空间不足问题

  • 先来看下 hdfs 的文件目录配置:


volumes:      - hadoop_datanode1:/hadoop/dfs/data
复制代码


  • 上面的 hadoop_datanode1 数据卷的配置在 docker-compose.yml 的最底部,是默认声明,如下:


volumes:  hadoop_namenode:  hadoop_datanode1:  hadoop_datanode2:  hadoop_datanode3:  hadoop_historyserver:
复制代码


  • 在容器运行状态,执行命令 docker inspect datanode1 查看容器信息,和数据卷相关的信息如下所示:


"Mounts": [            {                "Type": "volume",                "Name": "temp_hadoop_datanode1",                "Source": "/var/lib/docker/volumes/temp_hadoop_datanode1/_data",                "Destination": "/hadoop/dfs/data",                "Driver": "local",                "Mode": "rw",                "RW": true,                "Propagation": ""            }        ]
复制代码


  • 可见 hdfs 容器的文件目录对应的是宿主机的**/var/lib/docker/volumes**;

  • df -m 看看磁盘空间情况,如下所示,"/var/lib/docker/volumes"所在的"/dev/nvme0n1p3"设备可用空间只有 20 多 G(29561),显然在保存大量文件时这个空间是不够的,而且 hdfs 的默认副本数为 3:


root@willzhao-deepin:/data/work/spark/temp# df -m文件系统        1M-块   已用   可用 已用% 挂载点udev             7893      0   7893    0% /devtmpfs            1584      4   1581    1% /run/dev/nvme0n1p3  43927  12107  29561   30% /tmpfs            7918      0   7918    0% /dev/shmtmpfs               5      1      5    1% /run/locktmpfs            7918      0   7918    0% /sys/fs/cgroup/dev/nvme0n1p4  87854    181  83169    1% /home/dev/nvme0n1p1    300      7    293    3% /boot/efi/dev/sda1      468428 109152 335430   25% /datatmpfs            1584      1   1584    1% /run/user/108tmpfs            1584      0   1584    0% /run/user/0
复制代码


  • 上面的磁盘信息显示设备**/dev/sda1**还有 300G,所以 hdfs 的文件目录映射到/dev/sda1 就能缓解磁盘空间问题了,于是修改 docker-compose.yml 文件中 hdfs 的三个数据节点的配置,修改后如下:


datanode1:    image: bde2020/hadoop-datanode:1.1.0-hadoop2.7.1-java8    container_name: datanode1    depends_on:      - namenode    volumes:      - ./hadoop/datanode1:/hadoop/dfs/data    env_file:      - ./hadoop.env    datanode2:    image: bde2020/hadoop-datanode:1.1.0-hadoop2.7.1-java8    container_name: datanode2    depends_on:      - namenode    volumes:      - ./hadoop/datanode2:/hadoop/dfs/data    env_file:      - ./hadoop.env    datanode3:    image: bde2020/hadoop-datanode:1.1.0-hadoop2.7.1-java8    container_name: datanode3    depends_on:      - namenode    volumes:      - ./hadoop/datanode3:/hadoop/dfs/data    env_file:      - ./hadoop.env
复制代码


  • 再将下面这段配置删除:


volumes:  hadoop_namenode:  hadoop_datanode1:  hadoop_datanode2:  hadoop_datanode3:  hadoop_historyserver:
复制代码

开发 master 的 4040 和 work 的 8080 端口

  • 任务运行过程中,如果有 UI 页面来观察详情,可以帮助我们更全面直观的了解运行情况,所以需要修改配置开放端口;

  • 如下所示,expose 参数增加 4040,表示对外暴露 4040 端口,ports 参数增加 4040:4040,表示容器的 4040 映射到宿主机的 4040 端口:


  master:    image: gettyimages/spark:2.3.0-hadoop-2.8    container_name: master    command: bin/spark-class org.apache.spark.deploy.master.Master -h master    hostname: master    environment:      MASTER: spark://master:7077      SPARK_CONF_DIR: /conf      SPARK_PUBLIC_DNS: localhost    links:      - namenode    expose:      - 4040      - 7001      - 7002      - 7003      - 7004      - 7005      - 7077      - 6066    ports:      - 4040:4040      - 6066:6066      - 7077:7077      - 8080:8080    volumes:      - ./conf/master:/conf      - ./data:/tmp/data      - ./jars:/root/jars
复制代码


  • worker 的 web 端口同样需要打开,访问 worker 的 web 页面可以观察 worker 的状态,并且查看任务日志(这个很重要),这里要注意的是由于有多个 worker,所以要映射到宿主机的多个端口,如下配置,workder1 的 environment.SPARK_WORKER_WEBUI_PORT 配置为 8081,并且暴露 8081,再将容器的 8081 映射到宿主机的 8081,workder2 的 environment.SPARK_WORKER_WEBUI_PORT 配置为 8082,并且暴露 8082,再将容器的 8082 映射到宿主机的 8082:


 worker1:    image: gettyimages/spark:2.3.0-hadoop-2.8    container_name: worker1    command: bin/spark-class org.apache.spark.deploy.worker.Worker spark://master:7077    hostname: worker1    environment:      SPARK_CONF_DIR: /conf      SPARK_WORKER_CORES: 2      SPARK_WORKER_MEMORY: 2g      SPARK_WORKER_PORT: 8881      SPARK_WORKER_WEBUI_PORT: 8081      SPARK_PUBLIC_DNS: localhost    links:      - master    expose:      - 7012      - 7013      - 7014      - 7015      - 8881      - 8081    ports:      - 8081:8081    volumes:      - ./conf/worker1:/conf      - ./data/worker1:/tmp/data
worker2: image: gettyimages/spark:2.3.0-hadoop-2.8 container_name: worker2 command: bin/spark-class org.apache.spark.deploy.worker.Worker spark://master:7077 hostname: worker2 environment: SPARK_CONF_DIR: /conf SPARK_WORKER_CORES: 2 SPARK_WORKER_MEMORY: 2g SPARK_WORKER_PORT: 8881 SPARK_WORKER_WEBUI_PORT: 8082 SPARK_PUBLIC_DNS: localhost links: - master expose: - 7012 - 7013 - 7014 - 7015 - 8881 - 8082 ports: - 8082:8082 volumes: - ./conf/worker2:/conf - ./data/worker2:/tmp/data
复制代码


  • worker3-worker6 的配置与上面类似,注意用不同的端口号;

  • 至此,修改已经完成,最终版的 docker-compose.yml 内容如下:


version: "2.2"services:  namenode:    image: bde2020/hadoop-namenode:1.1.0-hadoop2.7.1-java8    container_name: namenode    volumes:      - ./hadoop/namenode:/hadoop/dfs/name      - ./input_files:/input_files    environment:      - CLUSTER_NAME=test    env_file:      - ./hadoop.env    ports:      - 50070:50070    resourcemanager:    image: bde2020/hadoop-resourcemanager:1.1.0-hadoop2.7.1-java8    container_name: resourcemanager    depends_on:      - namenode      - datanode1      - datanode2    env_file:      - ./hadoop.env    historyserver:    image: bde2020/hadoop-historyserver:1.1.0-hadoop2.7.1-java8    container_name: historyserver    depends_on:      - namenode      - datanode1      - datanode2    volumes:      - ./hadoop/historyserver:/hadoop/yarn/timeline    env_file:      - ./hadoop.env    nodemanager1:    image: bde2020/hadoop-nodemanager:1.1.0-hadoop2.7.1-java8    container_name: nodemanager1    depends_on:      - namenode      - datanode1      - datanode2    env_file:      - ./hadoop.env    datanode1:    image: bde2020/hadoop-datanode:1.1.0-hadoop2.7.1-java8    container_name: datanode1    depends_on:      - namenode    volumes:      - ./hadoop/datanode1:/hadoop/dfs/data    env_file:      - ./hadoop.env    datanode2:    image: bde2020/hadoop-datanode:1.1.0-hadoop2.7.1-java8    container_name: datanode2    depends_on:      - namenode    volumes:      - ./hadoop/datanode2:/hadoop/dfs/data    env_file:      - ./hadoop.env    datanode3:    image: bde2020/hadoop-datanode:1.1.0-hadoop2.7.1-java8    container_name: datanode3    depends_on:      - namenode    volumes:      - ./hadoop/datanode3:/hadoop/dfs/data    env_file:      - ./hadoop.env
master: image: gettyimages/spark:2.3.0-hadoop-2.8 container_name: master command: bin/spark-class org.apache.spark.deploy.master.Master -h master hostname: master environment: MASTER: spark://master:7077 SPARK_CONF_DIR: /conf SPARK_PUBLIC_DNS: localhost links: - namenode expose: - 4040 - 7001 - 7002 - 7003 - 7004 - 7005 - 7077 - 6066 ports: - 4040:4040 - 6066:6066 - 7077:7077 - 8080:8080 volumes: - ./conf/master:/conf - ./data:/tmp/data - ./jars:/root/jars
worker1: image: gettyimages/spark:2.3.0-hadoop-2.8 container_name: worker1 command: bin/spark-class org.apache.spark.deploy.worker.Worker spark://master:7077 hostname: worker1 environment: SPARK_CONF_DIR: /conf SPARK_WORKER_CORES: 2 SPARK_WORKER_MEMORY: 2g SPARK_WORKER_PORT: 8881 SPARK_WORKER_WEBUI_PORT: 8081 SPARK_PUBLIC_DNS: localhost links: - master expose: - 7012 - 7013 - 7014 - 7015 - 8881 - 8081 ports: - 8081:8081 volumes: - ./conf/worker1:/conf - ./data/worker1:/tmp/data
worker2: image: gettyimages/spark:2.3.0-hadoop-2.8 container_name: worker2 command: bin/spark-class org.apache.spark.deploy.worker.Worker spark://master:7077 hostname: worker2 environment: SPARK_CONF_DIR: /conf SPARK_WORKER_CORES: 2 SPARK_WORKER_MEMORY: 2g SPARK_WORKER_PORT: 8881 SPARK_WORKER_WEBUI_PORT: 8082 SPARK_PUBLIC_DNS: localhost links: - master expose: - 7012 - 7013 - 7014 - 7015 - 8881 - 8082 ports: - 8082:8082 volumes: - ./conf/worker2:/conf - ./data/worker2:/tmp/data
worker3: image: gettyimages/spark:2.3.0-hadoop-2.8 container_name: worker3 command: bin/spark-class org.apache.spark.deploy.worker.Worker spark://master:7077 hostname: worker3 environment: SPARK_CONF_DIR: /conf SPARK_WORKER_CORES: 2 SPARK_WORKER_MEMORY: 2g SPARK_WORKER_PORT: 8881 SPARK_WORKER_WEBUI_PORT: 8083 SPARK_PUBLIC_DNS: localhost links: - master expose: - 7012 - 7013 - 7014 - 7015 - 8881 - 8083 ports: - 8083:8083 volumes: - ./conf/worker3:/conf - ./data/worker3:/tmp/data
worker4: image: gettyimages/spark:2.3.0-hadoop-2.8 container_name: worker4 command: bin/spark-class org.apache.spark.deploy.worker.Worker spark://master:7077 hostname: worker4 environment: SPARK_CONF_DIR: /conf SPARK_WORKER_CORES: 2 SPARK_WORKER_MEMORY: 2g SPARK_WORKER_PORT: 8881 SPARK_WORKER_WEBUI_PORT: 8084 SPARK_PUBLIC_DNS: localhost links: - master expose: - 7012 - 7013 - 7014 - 7015 - 8881 - 8084 ports: - 8084:8084 volumes: - ./conf/worker4:/conf - ./data/worker4:/tmp/data
worker5: image: gettyimages/spark:2.3.0-hadoop-2.8 container_name: worker5 command: bin/spark-class org.apache.spark.deploy.worker.Worker spark://master:7077 hostname: worker5 environment: SPARK_CONF_DIR: /conf SPARK_WORKER_CORES: 2 SPARK_WORKER_MEMORY: 2g SPARK_WORKER_PORT: 8881 SPARK_WORKER_WEBUI_PORT: 8085 SPARK_PUBLIC_DNS: localhost links: - master expose: - 7012 - 7013 - 7014 - 7015 - 8881 - 8085 ports: - 8085:8085 volumes: - ./conf/worker5:/conf - ./data/worker5:/tmp/data
worker6: image: gettyimages/spark:2.3.0-hadoop-2.8 container_name: worker6 command: bin/spark-class org.apache.spark.deploy.worker.Worker spark://master:7077 hostname: worker6 environment: SPARK_CONF_DIR: /conf SPARK_WORKER_CORES: 2 SPARK_WORKER_MEMORY: 2g SPARK_WORKER_PORT: 8881 SPARK_WORKER_WEBUI_PORT: 8086 SPARK_PUBLIC_DNS: localhost links: - master expose: - 7012 - 7013 - 7014 - 7015 - 8881 - 8086 ports: - 8086:8086 volumes: - ./conf/worker6:/conf - ./data/worker6:/tmp/data
复制代码


  • 接下来我们运行一个实例来验证;

验证

  • 在 docker-compose.yml 所在目录创建 hadoop.env 文件,内容如下:


CORE_CONF_fs_defaultFS=hdfs://namenode:8020CORE_CONF_hadoop_http_staticuser_user=rootCORE_CONF_hadoop_proxyuser_hue_hosts=*CORE_CONF_hadoop_proxyuser_hue_groups=*
HDFS_CONF_dfs_webhdfs_enabled=trueHDFS_CONF_dfs_permissions_enabled=false
YARN_CONF_yarn_log___aggregation___enable=trueYARN_CONF_yarn_resourcemanager_recovery_enabled=trueYARN_CONF_yarn_resourcemanager_store_class=org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStoreYARN_CONF_yarn_resourcemanager_fs_state___store_uri=/rmstateYARN_CONF_yarn_nodemanager_remote___app___log___dir=/app-logsYARN_CONF_yarn_log_server_url=http://historyserver:8188/applicationhistory/logs/YARN_CONF_yarn_timeline___service_enabled=trueYARN_CONF_yarn_timeline___service_generic___application___history_enabled=trueYARN_CONF_yarn_resourcemanager_system___metrics___publisher_enabled=trueYARN_CONF_yarn_resourcemanager_hostname=resourcemanagerYARN_CONF_yarn_timeline___service_hostname=historyserverYARN_CONF_yarn_resourcemanager_address=resourcemanager:8032YARN_CONF_yarn_resourcemanager_scheduler_address=resourcemanager:8030YARN_CONF_yarn_resourcemanager_resource___tracker_address=resourcemanager:8031
复制代码


  • 修改好 docker-composes.yml 后,执行以下命令启动容器:


docker-compose up -d
复制代码


  • 此次验证所用的 spark 应用的功能是分析维基百科的网站统计信息,找出访问量最大的网页,本次实战用现成的 jar 包,不涉及编码,该应用的源码和开发详情请参照《spark实战之:分析维基百科网站统计数据(java版)》

  • 从 github 下载已经构建好的 spark 应用 jar 文件:


wget https://raw.githubusercontent.com/zq2599/blog_demos/master/files/sparkdemo-1.0-SNAPSHOT.jar
复制代码



wget https://raw.githubusercontent.com/zq2599/blog_demos/master/files/pagecounts-20160801-000000
复制代码


  • 将下载的 sparkdemo-1.0-SNAPSHOT.jar 文件放在 docker-compose.xml 所在目录的 jars 目录下;

  • 在 docker-compose.xml 所在目录的 input_files 目录内创建 input 目录,再将下载的 pagecounts-20160801-000000 文件放在这个 input 目录下;

  • 执行以下命令,将整个 input 目录放入 hdfs:


docker exec namenode hdfs dfs -put /input_files/input /
复制代码


  • 执行以下命令,提交一个任务,使用了 12 个 executor,每个 1G 内存:


docker exec -it master spark-submit \--class com.bolingcavalry.sparkdemo.app.WikiRank \--executor-memory 1g \--total-executor-cores 12 \/root/jars/sparkdemo-1.0-SNAPSHOT.jar \namenode \8020
复制代码


欢迎关注 InfoQ:程序员欣宸

学习路上,你不孤单,欣宸原创一路相伴...


发布于: 2022 年 08 月 20 日阅读数: 50
用户头像

搜索"程序员欣宸",一起畅游Java宇宙 2018.04.19 加入

前腾讯、前阿里员工,从事Java后台工作,对Docker和Kubernetes充满热爱,所有文章均为作者原创,个人Github:https://github.com/zq2599/blog_demos

评论

发布
暂无评论
docker下的spark集群,调整参数榨干硬件_Java_程序员欣宸_InfoQ写作社区