写点什么

Hadoop 集群间同步数据的最佳实践

作者:冰心的小屋
  • 2023-10-11
    北京
  • 本文字数:17473 字

    阅读完需:约 57 分钟

Hadoop 集群间同步数据的最佳实践

近期遇到的问题:

  1. Hadoop 集群间如何同步数据?

  2. Hadoop 集群间单方开启 Kerberos 认证如何同步数据?

  3. Hadoop 集群间均开启 Kerberos 认证如何同步数据?

调研的方向

  1. 调研 Hadoop 集群间通用的数据同步方式;

  2. 搭建 Hadoop 集群验证方案的可行性;

  3. 了解 Kerberos 原理,动手搭建 Kerberos 服务端和客户端;

  4. 调研 Hadoop 集群如何开启 Kerberos 认证;

  5. 调研 Hadoop 集群间单方开启 Kerberos 认证的数据同步;

  6. 调研 Hadoop 集群间双方均开启 kerberos 认证的数据同步。

1. 调研 Hadoop 集群间通用的数据同步方式

目前 Hadoop 集群间通用的数据同步方式是基于 Hadoop 提供的 hadoop distcp 命令,看看 distcp 命令后的代码,了解主要的实现逻辑

hadoop distcp 命令的实现类:DistCp,非核心逻辑都可以忽略,方法定位先后顺序:main -> execute -> createAndSubmitJob -> createJob,最终通过 createJob 定位到实际的实现类 CopyMapper:

CopyMapper 类原来是一个 MapReduce 的 Mapper:


继续定位数据同步逻辑:CopyMapper 类 -> CopyMapper.map -> CopyMapper.copyFileWithRetry -> RetriableFileCopyCommand.doExecute -> RetriableFileCopyCommand.doCopy -> RetriableFileCopyCommand.copyToFile -> FileSystem.create,最终通过 HDFS 的 FileSystem 创建的文件流来同步数据。

2. 搭建 Hadoop 集群验证方案的可行性

了解了 distcp 的实现逻辑,接下来就是动手实践,这时候你需要考虑同步过程中会遇到的各种问题:

  1. 如何设置 mapper 的数量,每个 mapper 可用的资源;

  2. 发生网络拥挤时是否可以限速;

  3. 任务状态异常,后续如何处理。


其实你想到的问题 distcp 的设计者已经想到了,使用 help 命令可关注以下参数:

  • m:mapper 数量;

  • bandwidth:网络带宽;

  • update:容错续传。


你还需要考虑任务需要的资源也就是每个容器使用的内存:

  • mapreduce.map.memory.mb:4096

  • yarn.app.mapreduce.am.resource.mb:4096

distcp 工具的使用方式:

# 本地 HDFS 集群到远程 HDFS 集群的拷贝hadoop distcp /input hdfs://192.168.1.102:9000/input # 远程 HDFS 集群到本地集群的拷贝hadoop distcp hdfs://92.168.1.102:9000/input /tmp # 其中比较重要的参数:# -bandwidth:可限制传输带宽有效降低集群 IO 压力# -log:整个执行过程记录日志# -update:只复制目标缺少的目录或文件# -m map 任务数量hadoop distcp -bandwidth 10 -log copy.log -update hdfs://92.168.1.102:9000/input /tmp
复制代码

distcp 不同场景的使用:

  1. 双方资源有限:

hadoop distcp \-D dfs.replication=$replication \-D mapreduce.map.memory.mb=4096 \-D yarn.app.mapreduce.am.resource.mb=4096 \-m 3 \-bandwidth 100 \-update \hdfs://92.168.1.102:9000/input /tmp
复制代码
  1. 双方有一定资源:

hadoop distcp \-D dfs.replication=$replication \-D mapreduce.map.memory.mb=4096 \-D yarn.app.mapreduce.am.resource.mb=4096 \-m 10 \-bandwidth 500 \-update \hdfs://92.168.1.102:9000/input /tmp
复制代码
  1. 双方资源充足:

hadoop distcp \-D dfs.replication=$replication \-D mapreduce.map.memory.mb=4096 \-D yarn.app.mapreduce.am.resource.mb=4096 \-m 30 \-bandwidth 2000 \-update \hdfs://92.168.1.102:9000/input /tmp
复制代码

3. 了解 Kerberos 原理,动手搭建 Kerberos 服务端和客户端

关于 Kerberos 的原理,甲骨文官方的文档讲的简单明了:https://docs.oracle.com/cd/E19253-01/819-7061/intro-fig-39/index.html


原理理解有些模糊的时候,动手实践时最好的途径。

3.1 搭建服务端

yum updateyum install -y krb5-libs krb5-server krb5-workstation pam_krb5
vim /etc/krb5.conf[logging] default = FILE:/var/log/krb5libs.log kdc = FILE:/var/log/krb5kdc.log admin_server = FILE:/var/log/kadmind.log
[libdefaults] default_realm = ICE.COM dns_lookup_realm = true dns_lookup_kdc = true ticket_lifetime = 24h forwardable = yes
[realms] ICE.COM = { kdc = ice.com:88 admin_server = ice.com:749 default_domain = ice.com }
[domain_realm] .ice.com = ICE.COM ice.com = ICE.COM
[appdefaults] pam = { debug = true ticket_lifetime = 36000 renew_lifetime = 36000 forwardable = true krb4_convert = false } vim /var/kerberos/krb5kdc/kdc.conf[kdcdefaults] v4_mode = nopreauth kdc_tcp_ports = 88
[realms] ICE.COM = { #master_key_type = des3-hmac-sha1 acl_file = /var/kerberos/krb5kdc/kadm5.acl dict_file = /usr/share/dict/words admin_keytab = /var/kerberos/krb5kdc/kadm5.keytab supported_enctypes = aes256-cts:normal aes128-cts:normal des3-hmac-sha1:normal arcfour-hmac:normal des-hmac-sha1:normal des-cbc-md5:normal des-cbc-crc:normal des-cbc-crc:v4 des-cbc-crc:afs3 } # 创建 KDC 数据库kdb5_util create -s -r ICE.COM
echo "*/admin@ICE.COM *" > /var/kerberos/krb5kdc/kadm5.aclsystemctl restart krb5kdcsystemctl restart kadmin
# 添加管理员kadmin.localaddprinc admin/adminktadd -k /var/kerberos/krb5kdc/krb5.keytab admin/admin
复制代码

3.2 搭建客户端

yum -y install krb5-workstation krb5-libsscp root@hadoop:/etc/krb5.conf /etc/krb5.confscp root@hadoop:/var/kerberos/krb5kdc/krb5.keytab /var/kerberos/krb5/krb5.keytabkinit -kt /var/kerberos/krb5/krb5.keytab admin/admin
复制代码

3. 3 Kerberos 常用操作

# 进入控制台kadmin.local# 创建数据库kdb5_util create -s -r BITNEI.COM# 添加用户使用随机密码addprinc -randkey principal# 添加用户使用自定义密码addprinc -pw 1234 principal# 生成 keytab 文件,重新生成密码xst -k /some_path/principal.keytab principal# 仅导出 keytab 文件,不重新生成密码xst -norandkey -k /some_path/principal.keytab principal# 查看所有 principallistprincs
复制代码

4. 调研 Hadoop 集群如何开启 Kerberos 认证

网上的资料千差万别万别,用户了一定的时间终于搭建好开启 Kerberos 认证的 Hadoop 集群。

4.1 安装 Kerberos

# 1.下载导入环境变量yum install -y krb5-libs krb5-server krb5-workstation krb5-auth-dialog -y
vim /etc/profileexport KRB_REALM=EXAMPLE.COMexport DOMAIN_REALM=example.comexport KERBEROS_ADMIN=admin/adminexport KERBEROS_ADMIN_PASSWORD=adminexport KEYTAB_DIR=/etc/security/keytabs
# 2. 设置 krb5.confvim /etc/krb5.conf[logging] default = FILE:/var/log/kerberos/krb5libs.log kdc = FILE:/var/log/kerberos/krb5kdc.log admin_server = FILE:/var/log/kerberos/kadmind.log
[libdefaults] default_realm = EXAMPLE.COM dns_lookup_realm = false dns_lookup_kdc = false ticket_lifetime = 24h renew_lifetime = 7d forwardable = true
[realms] EXAMPLE.COM = { kdc = kdc.kerberos.com admin_server = kdc.kerberos.com }
[domain_realm] .kdc.kerberos.com = EXAMPLE.COM kdc.kerberos.com = EXAMPLE.COM # 3. 设置 kdc.confvim /var/kerberos/krb5kdc/kdc.conf[kdcdefaults] kdc_ports = 88 kdc_tcp_ports = 88
[realms] EXAMPLE.COM = { acl_file = /var/kerberos/krb5kdc/kadm5.acl dict_file = /usr/share/dict/words admin_keytab = /var/kerberos/krb5kdc/kadm5.keytab supported_enctypes = aes256-cts:normal aes128-cts:normal des3-hmac-sha1:normal arcfour-hmac:normal camellia256-cts:normal camellia128-cts:normal des-hmac-sha1:normal des-cbc-md5:normal des-cbc-crc:normal } # 4. 创建数据库kdb5_util create -s -r EXAMPLE.COM
systemctl start krb5kdcsystemctl start kadminsystemctl enable krb5kdcsystemctl enable kadmin
# 5. 添加管理员kadmin.localaddprinc admin/admin
# 6. 添加 Hadoop 用户# 批量添加用户typeset -l HOST_NAMEHOST_NAME=$(hostname -f)
kadmin.local -q "addprinc -pw admin root"kadmin.local -q "xst -k nn.service.keytab nn/$HOST_NAME"
kadmin.local -q "addprinc -randkey nn/$(hostname -f)@${KRB_REALM}"kadmin.local -q "addprinc -randkey dn/$(hostname -f)@${KRB_REALM}"kadmin.local -q "addprinc -randkey HTTP/$(hostname -f)@${KRB_REALM}"kadmin.local -q "addprinc -randkey jhs/$(hostname -f)@${KRB_REALM}"kadmin.local -q "addprinc -randkey yarn/$(hostname -f)@${KRB_REALM}"kadmin.local -q "addprinc -randkey rm/$(hostname -f)@${KRB_REALM}"kadmin.local -q "addprinc -randkey nm/$(hostname -f)@${KRB_REALM}"kadmin.local -q "xst -k nn.service.keytab nn/$HOST_NAME"kadmin.local -q "xst -k dn.service.keytab dn/$HOST_NAME"kadmin.local -q "xst -k spnego.service.keytab HTTP/$HOST_NAME"kadmin.local -q "xst -k jhs.service.keytab jhs/$HOST_NAME"kadmin.local -q "xst -k yarn.service.keytab yarn/$HOST_NAME"kadmin.local -q "xst -k rm.service.keytab rm/$HOST_NAME"kadmin.local -q "xst -k nm.service.keytab nm/$HOST_NAME"
# 迁移到指定目录mkdir -p ${KEYTAB_DIR}mv nn.service.keytab ${KEYTAB_DIR}mv nn.service.keytab ${KEYTAB_DIR}mv dn.service.keytab ${KEYTAB_DIR}mv spnego.service.keytab ${KEYTAB_DIR}mv jhs.service.keytab ${KEYTAB_DIR}mv yarn.service.keytab ${KEYTAB_DIR}mv rm.service.keytab ${KEYTAB_DIR}mv nm.service.keytab ${KEYTAB_DIR}chmod 400 ${KEYTAB_DIR}/nn.service.keytabchmod 400 ${KEYTAB_DIR}/dn.service.keytabchmod 400 ${KEYTAB_DIR}/spnego.service.keytabchmod 400 ${KEYTAB_DIR}/jhs.service.keytabchmod 400 ${KEYTAB_DIR}/yarn.service.keytabchmod 400 ${KEYTAB_DIR}/rm.service.keytabchmod 400 ${KEYTAB_DIR}/nm.service.keytab
复制代码

4.2 安装 Hadoop

# 1. 安装 Javacurl -LOH "https://cfdownload.adobe.com/pub/adobe/coldfusion/java/java8/java8u371/jdk/jdk-8u371-linux-x64.rpm"
rpm -i jdk-8u371-linux-x64.rpm
echo "export JAVA_HOME=/usr/java/latest" >> /etc/profileecho "export PATH=\$PATH:\$JAVA_HOME/bin" >> /etc/profilesource /etc/profilerm -f /usr/bin/java && ln -s $JAVA_HOME/bin/java /usr/bin/java
curl -LOH 'Cookie: oraclelicense=accept-securebackup-cookie' 'http://download.oracle.com/otn-pub/java/jce/8/jce_policy-8.zip'
yum install unzip -yunzip jce_policy-8.zip
cp UnlimitedJCEPolicyJDK8/local_policy.jar UnlimitedJCEPolicyJDK8/US_export_policy.jar $JAVA_HOME/jre/lib/security
# 2. 下载 Hadoopwget -O hadoop-3.3.6.tar.gz https://mirrors.aliyun.com/apache/hadoop/core/hadoop-3.3.6/hadoop-3.3.6.tar.gz?spm=a2c6h.25603864.0.0.4bfc7aa1Bgcsmt
tar -xvf hadoop-3.3.6.tar.gz -C /usr/local/cd /usr/localln -s hadoop-3.3.6 hadoopchown root:root -R hadoop
# 3. 设置环境变量vim ~/.bash_profileexport HADOOP_PREFIX="/usr/local/hadoop"
export HADOOP_COMMON_HOME=$HADOOP_PREFIXexport HADOOP_HDFS_HOME=$HADOOP_PREFIXexport HADOOP_MAPRED_HOME=$HADOOP_PREFIXexport HADOOP_YARN_HOME=$HADOOP_PREFIXexport HADOOP_CONF_DIR=$HADOOP_PREFIX/etc/hadoopexport YARN_CONF_DIR=$HADOOP_PREFIX/etc/hadoopexport NM_CONTAINER_EXECUTOR_PATH=$HADOOP_PREFIX/bin/container-executorexport HADOOP_BIN_HOME=$HADOOP_PREFIX/binexport PATH=$PATH:$HADOOP_BIN_HOME
export HDFS_NAMENODE_USER=rootexport HDFS_DATANODE_USER=rootexport HDFS_SECONDARYNAMENODE_USER=rootexport YARN_RESOURCEMANAGER_USER=rootexport YARN_NODEMANAGER_USER=root
source ~/.bash_profile
复制代码


hadoop-evn.sh


# Licensed to the Apache Software Foundation (ASF) under one# or more contributor license agreements.  See the NOTICE file# distributed with this work for additional information# regarding copyright ownership.  The ASF licenses this file# to you under the Apache License, Version 2.0 (the# "License"); you may not use this file except in compliance# with the License.  You may obtain a copy of the License at##     http://www.apache.org/licenses/LICENSE-2.0## Unless required by applicable law or agreed to in writing, software# distributed under the License is distributed on an "AS IS" BASIS,# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.# See the License for the specific language governing permissions and# limitations under the License.
# Set Hadoop-specific environment variables here.
# The only required environment variable is JAVA_HOME. All others are# optional. When running a distributed configuration it is best to# set JAVA_HOME in this file, so that it is correctly defined on# remote nodes.
# The java implementation to use.export JAVA_HOME=/usr/java/defaultexport HADOOP_HOME=/usr/local/hadoop

# The jsvc implementation to use. Jsvc is required to run secure datanodes# that bind to privileged ports to provide authentication of data transfer# protocol. Jsvc is not required if SASL is configured for authentication of# data transfer protocol using non-privileged ports.#export JSVC_HOME=${JSVC_HOME}
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop/
# Extra Java CLASSPATH elements. Automatically insert capacity-scheduler.for f in $HADOOP_HOME/contrib/capacity-scheduler/*.jar; do if [ "$HADOOP_CLASSPATH" ]; then export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:$f else export HADOOP_CLASSPATH=$f fidone
# The maximum amount of heap to use, in MB. Default is 1000.#export HADOOP_HEAPSIZE=#export HADOOP_NAMENODE_INIT_HEAPSIZE=""
# Extra Java runtime options. Empty by default.export HADOOP_OPTS="$HADOOP_OPTS -Djava.net.preferIPv4Stack=true"export HADOOP_OPTS="$HADOOP_OPTS -Djava.library.path=$HADOOP_HOME/lib/native"

# Command specific options appended to HADOOP_OPTS when specifiedexport HADOOP_NAMENODE_OPTS="-Dhadoop.security.logger=${HADOOP_SECURITY_LOGGER:-INFO,RFAS} -Dhdfs.audit.logger=${HDFS_AUDIT_LOGGER:-INFO,NullAppender} $HADOOP_NAMENODE_OPTS"export HADOOP_DATANODE_OPTS="-Dhadoop.security.logger=ERROR,RFAS $HADOOP_DATANODE_OPTS"
export HADOOP_SECONDARYNAMENODE_OPTS="-Dhadoop.security.logger=${HADOOP_SECURITY_LOGGER:-INFO,RFAS} -Dhdfs.audit.logger=${HDFS_AUDIT_LOGGER:-INFO,NullAppender} $HADOOP_SECONDARYNAMENODE_OPTS"
export HADOOP_NFS3_OPTS="$HADOOP_NFS3_OPTS"export HADOOP_PORTMAP_OPTS="-Xmx512m $HADOOP_PORTMAP_OPTS"
# The following applies to multiple commands (fs, dfs, fsck, distcp etc)export HADOOP_CLIENT_OPTS="-Xmx512m $HADOOP_CLIENT_OPTS"#HADOOP_JAVA_PLATFORM_OPTS="-XX:-UsePerfData $HADOOP_JAVA_PLATFORM_OPTS"
# On secure datanodes, user to run the datanode as after dropping privileges.# This **MUST** be uncommented to enable secure HDFS if using privileged ports# to provide authentication of data transfer protocol. This **MUST NOT** be# defined if SASL is configured for authentication of data transfer protocol# using non-privileged ports.export HADOOP_SECURE_DN_USER=
# Where log files are stored. $HADOOP_HOME/logs by default.#export HADOOP_LOG_DIR=${HADOOP_LOG_DIR}/$USER
# Where log files are stored in the secure data environment.export HADOOP_SECURE_DN_LOG_DIR=${HADOOP_LOG_DIR}/${HADOOP_HDFS_USER}
#### HDFS Mover specific parameters#### Specify the JVM options to be used when starting the HDFS Mover.# These options will be appended to the options specified as HADOOP_OPTS# and therefore may override any similar flags set in HADOOP_OPTS## export HADOOP_MOVER_OPTS=""
#### Advanced Users Only!###
# The directory where pid files are stored. /tmp by default.# NOTE: this should be set to a directory that can only be written to by# the user that will run the hadoop daemons. Otherwise there is the# potential for a symlink attack.export HADOOP_PID_DIR=${HADOOP_PID_DIR}export HADOOP_SECURE_DN_PID_DIR=${HADOOP_PID_DIR}
# A string representing this instance of hadoop. $USER by default.export HADOOP_IDENT_STRING=$USER
复制代码


core-site.xml


<configuration>    <property>        <name>fs.defaultFS</name>        <value>hdfs://HOSTNAME:9000</value>    </property>    <property>         <name>hadoop.security.authentication</name>         <value>kerberos</value>         <description> Set the authentication for the cluster.         Valid values are: simple or kerberos.</description>    </property>    <property>         <name>hadoop.security.authorization</name>         <value>true</value>         <description>Enable authorization for different protocols.</description>    </property>    <property>        <name>hadoop.security.auth_to_local</name>        <value>        RULE:[2:$1@$0]([jt]t@.*EXAMPLE.COM)s/.*/root/        RULE:[2:$1@$0]([nd]n@.*EXAMPLE.COM)s/.*/root/        RULE:[2:$1@$0](hm@.*EXAMPLE.COM)s/.*/root/        RULE:[2:$1@$0](rs@.*EXAMPLE.COM)s/.*/root/        RULE:[2:$1@$0](rm@.*EXAMPLE.COM)s/.*/root/        RULE:[2:$1@$0](jhs@.*EXAMPLE.COM)s/.*/root/        DEFAULT        </value>        <description>The mapping from kerberos principal names        to local OS user names.</description>    </property>    <property>        <name>hadoop.ssl.require.client.cert</name>        <value>false</value>    </property>    <property>        <name>hadoop.ssl.hostname.verifier</name>        <value>DEFAULT</value>    </property>    <property>        <name>hadoop.ssl.keystores.factory.class</name>        <value>org.apache.hadoop.security.ssl.FileBasedKeyStoresFactory</value>    </property>    <property>        <name>hadoop.ssl.server.conf</name>        <value>ssl-server.xml</value>    </property>    <property>        <name>hadoop.ssl.client.conf</name>        <value>ssl-client.xml</value>    </property>    <property>        <name>hadoop.rpc.protection</name>        <value>privacy</value>    </property></configuration>
复制代码


hdfs-site.xml


<configuration>    <property>        <name>dfs.replication</name>        <value>1</value>    </property>    <property>         <name>dfs.permissions</name>         <value>true</value>         <description> If "true", enable permission checking in         HDFS. If "false", permission checking is turned         off, but all other behavior is         unchanged. Switching from one parameter value to the other does         not change the mode, owner or group of files or         directories. </description>    </property>
<property> <name>dfs.permissions.supergroup</name> <value>root</value> <description>The name of the group of super-users.</description> </property>
<property> <name>dfs.namenode.handler.count</name> <value>100</value> <description>Added to grow Queue size so that more client connections are allowed</description> </property>
<property> <name>ipc.server.max.response.size</name> <value>5242880</value> </property>
<property> <name>dfs.block.access.token.enable</name> <value>true</value> <description> If "true", access tokens are used as capabilities for accessing datanodes. If "false", no access tokens are checked on accessing datanodes. </description> </property>
<property> <name>dfs.namenode.kerberos.principal</name> <value>nn/_HOST@EXAMPLE.COM</value> <description> Kerberos principal name for the NameNode </description> </property>
<property> <name>dfs.secondary.namenode.kerberos.principal</name> <value>nn/_HOST@EXAMPLE.COM</value> <description>Kerberos principal name for the secondary NameNode. </description> </property>
<property> <!--cluster variant --> <name>dfs.secondary.http.address</name> <value>HOSTNAME:50090</value> <description>Address of secondary namenode web server</description> </property>
<property> <name>dfs.secondary.https.port</name> <value>50490</value> <description>The https port where secondary-namenode binds</description> </property>
<property> <name>dfs.web.authentication.kerberos.principal</name> <value>HTTP/_HOST@EXAMPLE.COM</value> <description> The HTTP Kerberos principal used by Hadoop-Auth in the HTTP endpoint. The HTTP Kerberos principal MUST start with 'HTTP/' per Kerberos HTTP SPNEGO specification. </description> </property>
<property> <name>dfs.web.authentication.kerberos.keytab</name> <value>/etc/security/keytabs/spnego.service.keytab</value> <description>The Kerberos keytab file with the credentials for the HTTP Kerberos principal used by Hadoop-Auth in the HTTP endpoint. </description> </property>
<property> <name>dfs.datanode.kerberos.principal</name> <value>dn/_HOST@EXAMPLE.COM</value> <description> The Kerberos principal that the DataNode runs as. "_HOST" is replaced by the real host name. </description> </property>
<property> <name>dfs.namenode.keytab.file</name> <value>/etc/security/keytabs/nn.service.keytab</value> <description> Combined keytab file containing the namenode service and host principals. </description> </property>
<property> <name>dfs.secondary.namenode.keytab.file</name> <value>/etc/security/keytabs/nn.service.keytab</value> <description> Combined keytab file containing the namenode service and host principals. </description> </property>
<property> <name>dfs.datanode.keytab.file</name> <value>/etc/security/keytabs/dn.service.keytab</value> <description> The filename of the keytab file for the DataNode. </description> </property>
<property> <name>dfs.https.port</name> <value>50470</value> <description>The https port where namenode binds</description> </property>
<property> <name>dfs.https.address</name> <value>HOSTNAME:50470</value> <description>The https address where namenode binds</description> </property>
<property> <name>dfs.datanode.data.dir.perm</name> <value>750</value> <description>The permissions that should be there on dfs.data.dir directories. The datanode will not come up if the permissions are different on existing dfs.data.dir directories. If the directories don't exist, they will be created with this permission.</description> </property>
<property> <name>dfs.access.time.precision</name> <value>0</value> <description>The access time for HDFS file is precise upto this value.The default value is 1 hour. Setting a value of 0 disables access times for HDFS. </description> </property>
<property> <name>dfs.cluster.administrators</name> <value>root</value> <description>ACL for who all can view the default servlets in the HDFS</description> </property>
<property> <name>ipc.server.read.threadpool.size</name> <value>5</value> <description></description> </property>
<property> <name>dfs.namenode.kerberos.internal.spnego.principal</name> <value>${dfs.web.authentication.kerberos.principal}</value> </property>
<property> <name>dfs.secondary.namenode.kerberos.internal.spnego.principal</name> <value>${dfs.web.authentication.kerberos.principal}</value> </property>
<property> <name>dfs.data.transfer.protection</name> <value>authentication</value> </property> <property> <name>dfs.encrypt.data.transfer</name> <value>true</value> </property>
<property> <name>dfs.datanode.data.dir.perm</name> <value>700</value> </property> <property> <name>dfs.datanode.address</name> <value>0.0.0.0:50010</value> </property> <property> <name>dfs.datanode.https.address</name> <value>0.0.0.0:50075</value> </property> <property> <name>dfs.namenode.https-address</name> <value>HOSTNAME:50470</value> </property> <property> <name>dfs.http.policy</name> <value>HTTPS_ONLY</value> </property> <property> <name>dfs.client.https.need-auth</name> <value>false</value> </property></configuration>
复制代码


ssl-client.xml


<?xml version="1.0"?><?xml-stylesheet type="text/xsl" href="configuration.xsl"?><!--   Licensed to the Apache Software Foundation (ASF) under one or more   contributor license agreements.  See the NOTICE file distributed with   this work for additional information regarding copyright ownership.   The ASF licenses this file to You under the Apache License, Version 2.0   (the "License"); you may not use this file except in compliance with   the License.  You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.--><configuration>
<property> <name>ssl.client.truststore.location</name> <value>/usr/local/hadoop/lib/keystore.jks</value> <description>Truststore to be used by clients like distcp. Must be specified. </description></property>
<property> <name>ssl.client.truststore.password</name> <value>bigdata</value> <description>Optional. Default value is "". </description></property>
<property> <name>ssl.client.truststore.type</name> <value>jks</value> <description>Optional. The keystore file format, default value is "jks". </description></property>
<property> <name>ssl.client.truststore.reload.interval</name> <value>10000</value> <description>Truststore reload check interval, in milliseconds. Default value is 10000 (10 seconds). </description></property>
<property> <name>ssl.client.keystore.location</name> <value>/usr/local/hadoop/lib/keystore.jks</value> <description>Keystore to be used by clients like distcp. Must be specified. </description></property>
<property> <name>ssl.client.keystore.password</name> <value>bigdata</value> <description>Optional. Default value is "". </description></property>
<property> <name>ssl.client.keystore.keypassword</name> <value>bigdata</value> <description>Optional. Default value is "". </description></property>
<property> <name>ssl.client.keystore.type</name> <value>jks</value> <description>Optional. The keystore file format, default value is "jks". </description></property></configuration>
复制代码


ssl-server.xml


<?xml version="1.0"?><?xml-stylesheet type="text/xsl" href="configuration.xsl"?><!--   Licensed to the Apache Software Foundation (ASF) under one or more   contributor license agreements.  See the NOTICE file distributed with   this work for additional information regarding copyright ownership.   The ASF licenses this file to You under the Apache License, Version 2.0   (the "License"); you may not use this file except in compliance with   the License.  You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.--><configuration>
<property> <name>ssl.server.truststore.location</name> <value>/usr/local/hadoop/lib/keystore.jks</value> <description>Truststore to be used by NN and DN. Must be specified. </description></property>
<property> <name>ssl.server.truststore.password</name> <value>bigdata</value> <description>Optional. Default value is "". </description></property>
<property> <name>ssl.server.truststore.type</name> <value>jks</value> <description>Optional. The keystore file format, default value is "jks". </description></property>
<property> <name>ssl.server.truststore.reload.interval</name> <value>10000</value> <description>Truststore reload check interval, in milliseconds. Default value is 10000 (10 seconds). </description></property>
<property> <name>ssl.server.keystore.location</name> <value>/usr/local/hadoop/lib/keystore.jks</value> <description>Keystore to be used by NN and DN. Must be specified. </description></property>
<property> <name>ssl.server.keystore.password</name> <value>bigdata</value> <description>Must be specified</description></property>
<property> <name>ssl.server.keystore.keypassword</name> <value>bigdata</value> <description>Must be specified. </description></property>
<property> <name>ssl.server.keystore.type</name> <value>jks</value> <description>Optional. The keystore file format, default value is "jks". </description></property>
</configuration>
复制代码


mapred-site.xml


<configuration>    <property>        <name>mapreduce.framework.name</name>        <value>yarn</value>    </property>    <property>        <name>mapreduce.jobhistory.keytab</name>        <value>/etc/security/keytabs/jhs.service.keytab</value>    </property>          <property>         <name>mapreduce.jobhistory.principal</name>         <value>jhs/_HOST@EXAMPLE.COM</value>    </property>          <property>         <name>mapreduce.jobhistory.webapp.address</name>         <value>HOSTNAME:19888</value>    </property>          <property>         <name>mapreduce.jobhistory.webapp.https.address</name>         <value>HOSTNAME:19889</value>    </property>          <property>         <name>mapreduce.jobhistory.webapp.spnego-keytab-file</name>         <value>/etc/security/keytabs/spnego.service.keytab</value>    </property>          <property>         <name>mapreduce.jobhistory.webapp.spnego-principal</name>         <value>HTTP/_HOST@EXAMPLE.COM</value>    </property>  </configuration>
复制代码


yarn-site.xml


<configuration>    <property>        <name>yarn.nodemanager.aux-services</name>        <value>mapreduce_shuffle</value>    </property>
<property> <name>yarn.application.classpath</name> <value>/usr/local/hadoop/etc/hadoop, /usr/local/hadoop/share/hadoop/common/*, /usr/local/hadoop/share/hadoop/common/lib/*, /usr/local/hadoop/share/hadoop/hdfs/*, /usr/local/hadoop/share/hadoop/hdfs/lib/*, /usr/local/hadoop/share/hadoop/mapreduce/*, /usr/local/hadoop/share/hadoop/mapreduce/lib/*, /usr/local/hadoop/share/hadoop/yarn/*, /usr/local/hadoop/share/hadoop/yarn/lib/*</value> </property>
<property> <description> Number of seconds after an application finishes before the nodemanager's DeletionService will delete the application's localized file directory and log directory.
To diagnose Yarn application problems, set this property's value large enough (for example, to 600 = 10 minutes) to permit examination of these directories. After changing the property's value, you must restart the nodemanager in order for it to have an effect.
The roots of Yarn applications' work directories is configurable with the yarn.nodemanager.local-dirs property (see below), and the roots of the Yarn applications' log directories is configurable with the yarn.nodemanager.log-dirs property (see also below). </description> <name>yarn.nodemanager.delete.debug-delay-sec</name> <value>600</value> </property> <property> <name>yarn.resourcemanager.principal</name> <value>rm/HOSTNAME@EXAMPLE.COM</value> </property> <property> <name>yarn.resourcemanager.keytab</name> <value>/etc/security/keytabs/rm.service.keytab</value> </property> <property> <name>yarn.nodemanager.principal</name> <value>nm/HOSTNAME@EXAMPLE.COM</value> </property> <property> <name>yarn.nodemanager.keytab</name> <value>/etc/security/keytabs/nm.service.keytab</value> </property> <property> <name>yarn.nodemanager.container-executor.class</name> <value>org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor</value> </property> <property> <name>yarn.nodemanager.linux-container-executor.path</name> <value>/usr/local/hadoop/bin/container-executor</value> </property> <property> <name>yarn.nodemanager.linux-container-executor.group</name> <value>root</value> </property> <property> <name>yarn.timeline-service.principal</name> <value>yarn/HOSTNAME@EXAMPLE.COM</value> </property> <property> <name>yarn.timeline-service.keytab</name> <value>/etc/security/keytabs/yarn.service.keytab</value> </property> <property> <name>yarn.resourcemanager.webapp.delegation-token-auth-filter.enabled</name> <value>true</value> </property> <property> <name>yarn.timeline-service.http-authentication.type</name> <value>kerberos</value> </property> <property> <name>yarn.timeline-service.http-authentication.kerberos.principal</name> <value>HTTP/HOSTNAME@EXAMPLE.COM</value> </property> <property> <name>yarn.timeline-service.http-authentication.kerberos.keytab</name> <value>/etc/security/keytabs/yarn.service.keytab</value> </property></configuration>
复制代码

5. 调研 Hadoop 集群间单方开启 Kerberos 认证的数据同步

首先你先要确定几件事:

  1. 使用谁的资源进行传输:也就是 Map 程序运行在哪个集群上;

  2. 确保运行 Map 程序的 hadoop 集群安装了 yarn;

  3. 集群间 namenode 和 datanode 端口互通。


这里面推荐的实现方式:

  1. 有 Kerberos 认证的集群执行 distcp 命令最合适,只要端口通无需做任何修改;

  2. 在有 Kerberos 认证的集群开个 linux 用户,无认证的集群可远程操作。

6. 调研 Hadoop 集群间双方均开启 kerberos 认证的数据同步

在调研双向 kerberos 认证的时候遇到了阻力,源集群和目标集群都需要进行认证,而当时测试的节点只能部署其中 1 个集群的配置信息,导致调研工作阻塞,直到查看到这篇文章:https://cm.bigdata.pens.ac.id/static/help/topics/cdh_admin_distcp_data_cluster_migrate.html。

方案一:双⽅ Kerberos 互信

实现流程:

  1. 双方 KDC 配置文件 /etc/krb5.conf 都需要追加对方 realms、domain_realm 和 capaths;

  2. 分别同步双方 /etc/krb5.conf 到对应集群的 kerberos client 端;

  3. 重启双方 KDC:systemctl restart kadmin 和 systemctl restart krb5kdc;

  4. 增加互通身份: kadmin.local -q "addprinc dc/hw.com@ICE.COM" 和 kadmin.local -q "addprinc dc/ice.com@HW.COM";

  5. 在 core-site.xml 中增加⽤户的映射规则;

  6. 重启 hadoop 集群;

  7. 复制一份 core-site.xml 融合双方集群的 namenode 地址、datanode 地址作为 distcp 依赖的上下文信息。

该⽅案优点:

  1. 效率⾼:distcp 命令通过提交 mapreduce 程序在 map 阶段进⾏数据传输,基于 block 流读⼀次写⼀次⼤幅度提升拷贝速度;

  2. 稳定:distcp 命令⽀持 HA 的配置,同时提供了多个参数⽤于限制 map 数量、线程数和⽹络传输速度。

该⽅案缺点:

  1. 对系统侵⼊性⼤:修改了 Kerberos 服务和 Hadoop 集群相关的⼀系列的配置;

  2. 影响系统稳定性:修改了配置必然会重启集群验证,从 0 开始搭建⾄少重启 20 次,要放在⽣产环境就歇菜了;

  3. 耦合⾼难扩展:双⽅实际上都知道了彼此,真正的耦合在⼀起,后续如果再来个需要同步的 Kerberos 集群该如何处理;

  4. 定位问题难度⾼:双方都需要鉴权,排查鉴权失败问题需要同时定位两个集群。

方案二:基于安全中⽴的 HDFS 集群


该⽅案主要是基于⼀个双⽅信任的中⽴的 HDFS 集群。

该 HDFS 集群具体要求:

  1. 硬件要求:CPU 和 内存不需要太⾼,硬盘 100 TB ⾜够了;

  2. ⽹络要求:双方和中立集群 HDFS 互通,彼此不需要互通;

  3. 软件要求:相同 HDFS 版本,只安装 NameNode 和 DataNode 即可,不需要安装 YARN 和 Kerberos;

  4. 规范传输目录结构。

具体步骤:

  1. 源集群本地 Kerberos ⾝份鉴权成功后,使⽤ hadoop distcp 命令同步数据;

  2. 中⽴ HDS 集群监听推送数据的任务,推送数据任务完成,应该有机制能通知到目标集群;

  3. 目标集群接收到中⽴ HDFS 推送的推送任务完成事件,从中⽴的 HDFS 集群拉取数据,之后目标集群进⾏后续操作。

该⽅案优点:

  1. 对系统侵⼊性⼩,风险⼩,易扩展:双⽅的 Kerberos 服务和 Hadoop 集群不要任何修改,实际上降低了风险;

  2. 增加系统稳定性:因为使⽤了中⽴集群,相⽐⽅案⼀,数据同步过程中双⽅集群资源不会互相影响;

  3. 实现简单:只需要搭建 HDFS 集群,授权可访问的 IP 即可传输数据。

该⽅案缺点;

  1. 增加了数据同步时间:数据多拷贝了⼀次;

  2. 后续如果实现数据同步的⾃动化,需要实现任务完成事件的发布和订阅;

  3. 中⽴的 HDFS 集群增加了硬件成本。


发布于: 刚刚阅读数: 5
用户头像

分享技术上的点滴收获! 2013-08-06 加入

一杯咖啡,一首老歌,一段代码,欢迎做客冰屋,享受编码和技术带来的快乐! 欢迎关注公众号:冰心的小屋

评论

发布
暂无评论
Hadoop 集群间同步数据的最佳实践_hadoop_冰心的小屋_InfoQ写作社区