Hadoop 集群间同步数据的最佳实践
- 2023-10-11 北京
本文字数:17473 字
阅读完需:约 57 分钟
近期遇到的问题:
Hadoop 集群间如何同步数据?
Hadoop 集群间单方开启 Kerberos 认证如何同步数据?
Hadoop 集群间均开启 Kerberos 认证如何同步数据?
调研的方向
调研 Hadoop 集群间通用的数据同步方式;
搭建 Hadoop 集群验证方案的可行性;
了解 Kerberos 原理,动手搭建 Kerberos 服务端和客户端;
调研 Hadoop 集群如何开启 Kerberos 认证;
调研 Hadoop 集群间单方开启 Kerberos 认证的数据同步;
调研 Hadoop 集群间双方均开启 kerberos 认证的数据同步。
1. 调研 Hadoop 集群间通用的数据同步方式
目前 Hadoop 集群间通用的数据同步方式是基于 Hadoop 提供的 hadoop distcp 命令,看看 distcp 命令后的代码,了解主要的实现逻辑
hadoop distcp 命令的实现类:DistCp,非核心逻辑都可以忽略,方法定位先后顺序:main -> execute -> createAndSubmitJob -> createJob,最终通过 createJob 定位到实际的实现类 CopyMapper:
CopyMapper 类原来是一个 MapReduce 的 Mapper:
继续定位数据同步逻辑:CopyMapper 类 -> CopyMapper.map -> CopyMapper.copyFileWithRetry -> RetriableFileCopyCommand.doExecute -> RetriableFileCopyCommand.doCopy -> RetriableFileCopyCommand.copyToFile -> FileSystem.create,最终通过 HDFS 的 FileSystem 创建的文件流来同步数据。
2. 搭建 Hadoop 集群验证方案的可行性
了解了 distcp 的实现逻辑,接下来就是动手实践,这时候你需要考虑同步过程中会遇到的各种问题:
如何设置 mapper 的数量,每个 mapper 可用的资源;
发生网络拥挤时是否可以限速;
任务状态异常,后续如何处理。
其实你想到的问题 distcp 的设计者已经想到了,使用 help 命令可关注以下参数:
m:mapper 数量;
bandwidth:网络带宽;
update:容错续传。
你还需要考虑任务需要的资源也就是每个容器使用的内存:
mapreduce.map.memory.mb:4096
yarn.app.mapreduce.am.resource.mb:4096
distcp 工具的使用方式:
# 本地 HDFS 集群到远程 HDFS 集群的拷贝
hadoop distcp /input hdfs://192.168.1.102:9000/input
# 远程 HDFS 集群到本地集群的拷贝
hadoop distcp hdfs://92.168.1.102:9000/input /tmp
# 其中比较重要的参数:
# -bandwidth:可限制传输带宽有效降低集群 IO 压力
# -log:整个执行过程记录日志
# -update:只复制目标缺少的目录或文件
# -m map 任务数量
hadoop distcp -bandwidth 10 -log copy.log -update hdfs://92.168.1.102:9000/input /tmp
distcp 不同场景的使用:
双方资源有限:
hadoop distcp \
-D dfs.replication=$replication \
-D mapreduce.map.memory.mb=4096 \
-D yarn.app.mapreduce.am.resource.mb=4096 \
-m 3 \
-bandwidth 100 \
-update \
hdfs://92.168.1.102:9000/input /tmp
双方有一定资源:
hadoop distcp \
-D dfs.replication=$replication \
-D mapreduce.map.memory.mb=4096 \
-D yarn.app.mapreduce.am.resource.mb=4096 \
-m 10 \
-bandwidth 500 \
-update \
hdfs://92.168.1.102:9000/input /tmp
双方资源充足:
hadoop distcp \
-D dfs.replication=$replication \
-D mapreduce.map.memory.mb=4096 \
-D yarn.app.mapreduce.am.resource.mb=4096 \
-m 30 \
-bandwidth 2000 \
-update \
hdfs://92.168.1.102:9000/input /tmp
3. 了解 Kerberos 原理,动手搭建 Kerberos 服务端和客户端
关于 Kerberos 的原理,甲骨文官方的文档讲的简单明了:https://docs.oracle.com/cd/E19253-01/819-7061/intro-fig-39/index.html
原理理解有些模糊的时候,动手实践时最好的途径。
3.1 搭建服务端
yum update
yum install -y krb5-libs krb5-server krb5-workstation pam_krb5
vim /etc/krb5.conf
[logging]
default = FILE:/var/log/krb5libs.log
kdc = FILE:/var/log/krb5kdc.log
admin_server = FILE:/var/log/kadmind.log
[libdefaults]
default_realm = ICE.COM
dns_lookup_realm = true
dns_lookup_kdc = true
ticket_lifetime = 24h
forwardable = yes
[realms]
ICE.COM = {
kdc = ice.com:88
admin_server = ice.com:749
default_domain = ice.com
}
[domain_realm]
.ice.com = ICE.COM
ice.com = ICE.COM
[appdefaults]
pam = {
debug = true
ticket_lifetime = 36000
renew_lifetime = 36000
forwardable = true
krb4_convert = false
}
vim /var/kerberos/krb5kdc/kdc.conf
[kdcdefaults]
v4_mode = nopreauth
kdc_tcp_ports = 88
[realms]
ICE.COM = {
#master_key_type = des3-hmac-sha1
acl_file = /var/kerberos/krb5kdc/kadm5.acl
dict_file = /usr/share/dict/words
admin_keytab = /var/kerberos/krb5kdc/kadm5.keytab
supported_enctypes = aes256-cts:normal aes128-cts:normal des3-hmac-sha1:normal arcfour-hmac:normal des-hmac-sha1:normal des-cbc-md5:normal des-cbc-crc:normal des-cbc-crc:v4 des-cbc-crc:afs3
}
# 创建 KDC 数据库
kdb5_util create -s -r ICE.COM
echo "*/admin@ICE.COM *" > /var/kerberos/krb5kdc/kadm5.acl
systemctl restart krb5kdc
systemctl restart kadmin
# 添加管理员
kadmin.local
addprinc admin/admin
ktadd -k /var/kerberos/krb5kdc/krb5.keytab admin/admin
3.2 搭建客户端
yum -y install krb5-workstation krb5-libs
scp root@hadoop:/etc/krb5.conf /etc/krb5.conf
scp root@hadoop:/var/kerberos/krb5kdc/krb5.keytab /var/kerberos/krb5/krb5.keytab
kinit -kt /var/kerberos/krb5/krb5.keytab admin/admin
3. 3 Kerberos 常用操作
# 进入控制台
kadmin.local
# 创建数据库
kdb5_util create -s -r BITNEI.COM
# 添加用户使用随机密码
addprinc -randkey principal
# 添加用户使用自定义密码
addprinc -pw 1234 principal
# 生成 keytab 文件,重新生成密码
xst -k /some_path/principal.keytab principal
# 仅导出 keytab 文件,不重新生成密码
xst -norandkey -k /some_path/principal.keytab principal
# 查看所有 principal
listprincs
4. 调研 Hadoop 集群如何开启 Kerberos 认证
网上的资料千差万别万别,用户了一定的时间终于搭建好开启 Kerberos 认证的 Hadoop 集群。
4.1 安装 Kerberos
# 1.下载导入环境变量
yum install -y krb5-libs krb5-server krb5-workstation krb5-auth-dialog -y
vim /etc/profile
export KRB_REALM=EXAMPLE.COM
export DOMAIN_REALM=example.com
export KERBEROS_ADMIN=admin/admin
export KERBEROS_ADMIN_PASSWORD=admin
export KEYTAB_DIR=/etc/security/keytabs
# 2. 设置 krb5.conf
vim /etc/krb5.conf
[logging]
default = FILE:/var/log/kerberos/krb5libs.log
kdc = FILE:/var/log/kerberos/krb5kdc.log
admin_server = FILE:/var/log/kerberos/kadmind.log
[libdefaults]
default_realm = EXAMPLE.COM
dns_lookup_realm = false
dns_lookup_kdc = false
ticket_lifetime = 24h
renew_lifetime = 7d
forwardable = true
[realms]
EXAMPLE.COM = {
kdc = kdc.kerberos.com
admin_server = kdc.kerberos.com
}
[domain_realm]
.kdc.kerberos.com = EXAMPLE.COM
kdc.kerberos.com = EXAMPLE.COM
# 3. 设置 kdc.conf
vim /var/kerberos/krb5kdc/kdc.conf
[kdcdefaults]
kdc_ports = 88
kdc_tcp_ports = 88
[realms]
EXAMPLE.COM = {
acl_file = /var/kerberos/krb5kdc/kadm5.acl
dict_file = /usr/share/dict/words
admin_keytab = /var/kerberos/krb5kdc/kadm5.keytab
supported_enctypes = aes256-cts:normal aes128-cts:normal des3-hmac-sha1:normal arcfour-hmac:normal camellia256-cts:normal camellia128-cts:normal des-hmac-sha1:normal des-cbc-md5:normal des-cbc-crc:normal
}
# 4. 创建数据库
kdb5_util create -s -r EXAMPLE.COM
systemctl start krb5kdc
systemctl start kadmin
systemctl enable krb5kdc
systemctl enable kadmin
# 5. 添加管理员
kadmin.local
addprinc admin/admin
# 6. 添加 Hadoop 用户
# 批量添加用户
typeset -l HOST_NAME
HOST_NAME=$(hostname -f)
kadmin.local -q "addprinc -pw admin root"
kadmin.local -q "xst -k nn.service.keytab nn/$HOST_NAME"
kadmin.local -q "addprinc -randkey nn/$(hostname -f)@${KRB_REALM}"
kadmin.local -q "addprinc -randkey dn/$(hostname -f)@${KRB_REALM}"
kadmin.local -q "addprinc -randkey HTTP/$(hostname -f)@${KRB_REALM}"
kadmin.local -q "addprinc -randkey jhs/$(hostname -f)@${KRB_REALM}"
kadmin.local -q "addprinc -randkey yarn/$(hostname -f)@${KRB_REALM}"
kadmin.local -q "addprinc -randkey rm/$(hostname -f)@${KRB_REALM}"
kadmin.local -q "addprinc -randkey nm/$(hostname -f)@${KRB_REALM}"
kadmin.local -q "xst -k nn.service.keytab nn/$HOST_NAME"
kadmin.local -q "xst -k dn.service.keytab dn/$HOST_NAME"
kadmin.local -q "xst -k spnego.service.keytab HTTP/$HOST_NAME"
kadmin.local -q "xst -k jhs.service.keytab jhs/$HOST_NAME"
kadmin.local -q "xst -k yarn.service.keytab yarn/$HOST_NAME"
kadmin.local -q "xst -k rm.service.keytab rm/$HOST_NAME"
kadmin.local -q "xst -k nm.service.keytab nm/$HOST_NAME"
# 迁移到指定目录
mkdir -p ${KEYTAB_DIR}
mv nn.service.keytab ${KEYTAB_DIR}
mv nn.service.keytab ${KEYTAB_DIR}
mv dn.service.keytab ${KEYTAB_DIR}
mv spnego.service.keytab ${KEYTAB_DIR}
mv jhs.service.keytab ${KEYTAB_DIR}
mv yarn.service.keytab ${KEYTAB_DIR}
mv rm.service.keytab ${KEYTAB_DIR}
mv nm.service.keytab ${KEYTAB_DIR}
chmod 400 ${KEYTAB_DIR}/nn.service.keytab
chmod 400 ${KEYTAB_DIR}/dn.service.keytab
chmod 400 ${KEYTAB_DIR}/spnego.service.keytab
chmod 400 ${KEYTAB_DIR}/jhs.service.keytab
chmod 400 ${KEYTAB_DIR}/yarn.service.keytab
chmod 400 ${KEYTAB_DIR}/rm.service.keytab
chmod 400 ${KEYTAB_DIR}/nm.service.keytab
4.2 安装 Hadoop
# 1. 安装 Java
curl -LOH "https://cfdownload.adobe.com/pub/adobe/coldfusion/java/java8/java8u371/jdk/jdk-8u371-linux-x64.rpm"
rpm -i jdk-8u371-linux-x64.rpm
echo "export JAVA_HOME=/usr/java/latest" >> /etc/profile
echo "export PATH=\$PATH:\$JAVA_HOME/bin" >> /etc/profile
source /etc/profile
rm -f /usr/bin/java && ln -s $JAVA_HOME/bin/java /usr/bin/java
curl -LOH 'Cookie: oraclelicense=accept-securebackup-cookie' 'http://download.oracle.com/otn-pub/java/jce/8/jce_policy-8.zip'
yum install unzip -y
unzip jce_policy-8.zip
cp UnlimitedJCEPolicyJDK8/local_policy.jar UnlimitedJCEPolicyJDK8/US_export_policy.jar $JAVA_HOME/jre/lib/security
# 2. 下载 Hadoop
wget -O hadoop-3.3.6.tar.gz https://mirrors.aliyun.com/apache/hadoop/core/hadoop-3.3.6/hadoop-3.3.6.tar.gz?spm=a2c6h.25603864.0.0.4bfc7aa1Bgcsmt
tar -xvf hadoop-3.3.6.tar.gz -C /usr/local/
cd /usr/local
ln -s hadoop-3.3.6 hadoop
chown root:root -R hadoop
# 3. 设置环境变量
vim ~/.bash_profile
export HADOOP_PREFIX="/usr/local/hadoop"
export HADOOP_COMMON_HOME=$HADOOP_PREFIX
export HADOOP_HDFS_HOME=$HADOOP_PREFIX
export HADOOP_MAPRED_HOME=$HADOOP_PREFIX
export HADOOP_YARN_HOME=$HADOOP_PREFIX
export HADOOP_CONF_DIR=$HADOOP_PREFIX/etc/hadoop
export YARN_CONF_DIR=$HADOOP_PREFIX/etc/hadoop
export NM_CONTAINER_EXECUTOR_PATH=$HADOOP_PREFIX/bin/container-executor
export HADOOP_BIN_HOME=$HADOOP_PREFIX/bin
export PATH=$PATH:$HADOOP_BIN_HOME
export HDFS_NAMENODE_USER=root
export HDFS_DATANODE_USER=root
export HDFS_SECONDARYNAMENODE_USER=root
export YARN_RESOURCEMANAGER_USER=root
export YARN_NODEMANAGER_USER=root
source ~/.bash_profile
hadoop-evn.sh
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# Set Hadoop-specific environment variables here.
# The only required environment variable is JAVA_HOME. All others are
# optional. When running a distributed configuration it is best to
# set JAVA_HOME in this file, so that it is correctly defined on
# remote nodes.
# The java implementation to use.
export JAVA_HOME=/usr/java/default
export HADOOP_HOME=/usr/local/hadoop
# The jsvc implementation to use. Jsvc is required to run secure datanodes
# that bind to privileged ports to provide authentication of data transfer
# protocol. Jsvc is not required if SASL is configured for authentication of
# data transfer protocol using non-privileged ports.
#export JSVC_HOME=${JSVC_HOME}
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop/
# Extra Java CLASSPATH elements. Automatically insert capacity-scheduler.
for f in $HADOOP_HOME/contrib/capacity-scheduler/*.jar; do
if [ "$HADOOP_CLASSPATH" ]; then
export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:$f
else
export HADOOP_CLASSPATH=$f
fi
done
# The maximum amount of heap to use, in MB. Default is 1000.
#export HADOOP_HEAPSIZE=
#export HADOOP_NAMENODE_INIT_HEAPSIZE=""
# Extra Java runtime options. Empty by default.
export HADOOP_OPTS="$HADOOP_OPTS -Djava.net.preferIPv4Stack=true"
export HADOOP_OPTS="$HADOOP_OPTS -Djava.library.path=$HADOOP_HOME/lib/native"
# Command specific options appended to HADOOP_OPTS when specified
export HADOOP_NAMENODE_OPTS="-Dhadoop.security.logger=${HADOOP_SECURITY_LOGGER:-INFO,RFAS} -Dhdfs.audit.logger=${HDFS_AUDIT_LOGGER:-INFO,NullAppender} $HADOOP_NAMENODE_OPTS"
export HADOOP_DATANODE_OPTS="-Dhadoop.security.logger=ERROR,RFAS $HADOOP_DATANODE_OPTS"
export HADOOP_SECONDARYNAMENODE_OPTS="-Dhadoop.security.logger=${HADOOP_SECURITY_LOGGER:-INFO,RFAS} -Dhdfs.audit.logger=${HDFS_AUDIT_LOGGER:-INFO,NullAppender} $HADOOP_SECONDARYNAMENODE_OPTS"
export HADOOP_NFS3_OPTS="$HADOOP_NFS3_OPTS"
export HADOOP_PORTMAP_OPTS="-Xmx512m $HADOOP_PORTMAP_OPTS"
# The following applies to multiple commands (fs, dfs, fsck, distcp etc)
export HADOOP_CLIENT_OPTS="-Xmx512m $HADOOP_CLIENT_OPTS"
#HADOOP_JAVA_PLATFORM_OPTS="-XX:-UsePerfData $HADOOP_JAVA_PLATFORM_OPTS"
# On secure datanodes, user to run the datanode as after dropping privileges.
# This **MUST** be uncommented to enable secure HDFS if using privileged ports
# to provide authentication of data transfer protocol. This **MUST NOT** be
# defined if SASL is configured for authentication of data transfer protocol
# using non-privileged ports.
export HADOOP_SECURE_DN_USER=
# Where log files are stored. $HADOOP_HOME/logs by default.
#export HADOOP_LOG_DIR=${HADOOP_LOG_DIR}/$USER
# Where log files are stored in the secure data environment.
export HADOOP_SECURE_DN_LOG_DIR=${HADOOP_LOG_DIR}/${HADOOP_HDFS_USER}
###
# HDFS Mover specific parameters
###
# Specify the JVM options to be used when starting the HDFS Mover.
# These options will be appended to the options specified as HADOOP_OPTS
# and therefore may override any similar flags set in HADOOP_OPTS
#
# export HADOOP_MOVER_OPTS=""
###
# Advanced Users Only!
###
# The directory where pid files are stored. /tmp by default.
# NOTE: this should be set to a directory that can only be written to by
# the user that will run the hadoop daemons. Otherwise there is the
# potential for a symlink attack.
export HADOOP_PID_DIR=${HADOOP_PID_DIR}
export HADOOP_SECURE_DN_PID_DIR=${HADOOP_PID_DIR}
# A string representing this instance of hadoop. $USER by default.
export HADOOP_IDENT_STRING=$USER
core-site.xml
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://HOSTNAME:9000</value>
</property>
<property>
<name>hadoop.security.authentication</name>
<value>kerberos</value>
<description> Set the authentication for the cluster.
Valid values are: simple or kerberos.</description>
</property>
<property>
<name>hadoop.security.authorization</name>
<value>true</value>
<description>Enable authorization for different protocols.</description>
</property>
<property>
<name>hadoop.security.auth_to_local</name>
<value>
RULE:[2:$1@$0]([jt]t@.*EXAMPLE.COM)s/.*/root/
RULE:[2:$1@$0]([nd]n@.*EXAMPLE.COM)s/.*/root/
RULE:[2:$1@$0](hm@.*EXAMPLE.COM)s/.*/root/
RULE:[2:$1@$0](rs@.*EXAMPLE.COM)s/.*/root/
RULE:[2:$1@$0](rm@.*EXAMPLE.COM)s/.*/root/
RULE:[2:$1@$0](jhs@.*EXAMPLE.COM)s/.*/root/
DEFAULT
</value>
<description>The mapping from kerberos principal names
to local OS user names.</description>
</property>
<property>
<name>hadoop.ssl.require.client.cert</name>
<value>false</value>
</property>
<property>
<name>hadoop.ssl.hostname.verifier</name>
<value>DEFAULT</value>
</property>
<property>
<name>hadoop.ssl.keystores.factory.class</name>
<value>org.apache.hadoop.security.ssl.FileBasedKeyStoresFactory</value>
</property>
<property>
<name>hadoop.ssl.server.conf</name>
<value>ssl-server.xml</value>
</property>
<property>
<name>hadoop.ssl.client.conf</name>
<value>ssl-client.xml</value>
</property>
<property>
<name>hadoop.rpc.protection</name>
<value>privacy</value>
</property>
</configuration>
hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.permissions</name>
<value>true</value>
<description> If "true", enable permission checking in
HDFS. If "false", permission checking is turned
off, but all other behavior is
unchanged. Switching from one parameter value to the other does
not change the mode, owner or group of files or
directories. </description>
</property>
<property>
<name>dfs.permissions.supergroup</name>
<value>root</value>
<description>The name of the group of super-users.</description>
</property>
<property>
<name>dfs.namenode.handler.count</name>
<value>100</value>
<description>Added to grow Queue size so that more client connections are allowed</description>
</property>
<property>
<name>ipc.server.max.response.size</name>
<value>5242880</value>
</property>
<property>
<name>dfs.block.access.token.enable</name>
<value>true</value>
<description> If "true", access tokens are used as capabilities
for accessing datanodes. If "false", no access tokens are checked on
accessing datanodes. </description>
</property>
<property>
<name>dfs.namenode.kerberos.principal</name>
<value>nn/_HOST@EXAMPLE.COM</value>
<description> Kerberos principal name for the NameNode </description>
</property>
<property>
<name>dfs.secondary.namenode.kerberos.principal</name>
<value>nn/_HOST@EXAMPLE.COM</value>
<description>Kerberos principal name for the secondary NameNode.
</description>
</property>
<property>
<!--cluster variant -->
<name>dfs.secondary.http.address</name>
<value>HOSTNAME:50090</value>
<description>Address of secondary namenode web server</description>
</property>
<property>
<name>dfs.secondary.https.port</name>
<value>50490</value>
<description>The https port where secondary-namenode binds</description>
</property>
<property>
<name>dfs.web.authentication.kerberos.principal</name>
<value>HTTP/_HOST@EXAMPLE.COM</value>
<description> The HTTP Kerberos principal used by Hadoop-Auth in the HTTP endpoint.
The HTTP Kerberos principal MUST start with 'HTTP/' per Kerberos HTTP
SPNEGO specification.
</description>
</property>
<property>
<name>dfs.web.authentication.kerberos.keytab</name>
<value>/etc/security/keytabs/spnego.service.keytab</value>
<description>The Kerberos keytab file with the credentials for the HTTP
Kerberos principal used by Hadoop-Auth in the HTTP endpoint.
</description>
</property>
<property>
<name>dfs.datanode.kerberos.principal</name>
<value>dn/_HOST@EXAMPLE.COM</value>
<description>
The Kerberos principal that the DataNode runs as. "_HOST" is replaced by the real
host name.
</description>
</property>
<property>
<name>dfs.namenode.keytab.file</name>
<value>/etc/security/keytabs/nn.service.keytab</value>
<description>
Combined keytab file containing the namenode service and host
principals.
</description>
</property>
<property>
<name>dfs.secondary.namenode.keytab.file</name>
<value>/etc/security/keytabs/nn.service.keytab</value>
<description>
Combined keytab file containing the namenode service and host
principals.
</description>
</property>
<property>
<name>dfs.datanode.keytab.file</name>
<value>/etc/security/keytabs/dn.service.keytab</value>
<description>
The filename of the keytab file for the DataNode.
</description>
</property>
<property>
<name>dfs.https.port</name>
<value>50470</value>
<description>The https port where namenode binds</description>
</property>
<property>
<name>dfs.https.address</name>
<value>HOSTNAME:50470</value>
<description>The https address where namenode binds</description>
</property>
<property>
<name>dfs.datanode.data.dir.perm</name>
<value>750</value>
<description>The permissions that should be there on
dfs.data.dir directories. The datanode will not come up if the
permissions are different on existing dfs.data.dir directories. If
the directories don't exist, they will be created with this
permission.</description>
</property>
<property>
<name>dfs.access.time.precision</name>
<value>0</value>
<description>The access time for HDFS file is precise upto this
value.The default value is 1 hour. Setting a value of 0
disables access times for HDFS.
</description>
</property>
<property>
<name>dfs.cluster.administrators</name>
<value>root</value>
<description>ACL for who all can view the default servlets in the HDFS</description>
</property>
<property>
<name>ipc.server.read.threadpool.size</name>
<value>5</value>
<description></description>
</property>
<property>
<name>dfs.namenode.kerberos.internal.spnego.principal</name>
<value>${dfs.web.authentication.kerberos.principal}</value>
</property>
<property>
<name>dfs.secondary.namenode.kerberos.internal.spnego.principal</name>
<value>${dfs.web.authentication.kerberos.principal}</value>
</property>
<property>
<name>dfs.data.transfer.protection</name>
<value>authentication</value>
</property>
<property>
<name>dfs.encrypt.data.transfer</name>
<value>true</value>
</property>
<property>
<name>dfs.datanode.data.dir.perm</name>
<value>700</value>
</property>
<property>
<name>dfs.datanode.address</name>
<value>0.0.0.0:50010</value>
</property>
<property>
<name>dfs.datanode.https.address</name>
<value>0.0.0.0:50075</value>
</property>
<property>
<name>dfs.namenode.https-address</name>
<value>HOSTNAME:50470</value>
</property>
<property>
<name>dfs.http.policy</name>
<value>HTTPS_ONLY</value>
</property>
<property>
<name>dfs.client.https.need-auth</name>
<value>false</value>
</property>
</configuration>
ssl-client.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<configuration>
<property>
<name>ssl.client.truststore.location</name>
<value>/usr/local/hadoop/lib/keystore.jks</value>
<description>Truststore to be used by clients like distcp. Must be
specified.
</description>
</property>
<property>
<name>ssl.client.truststore.password</name>
<value>bigdata</value>
<description>Optional. Default value is "".
</description>
</property>
<property>
<name>ssl.client.truststore.type</name>
<value>jks</value>
<description>Optional. The keystore file format, default value is "jks".
</description>
</property>
<property>
<name>ssl.client.truststore.reload.interval</name>
<value>10000</value>
<description>Truststore reload check interval, in milliseconds.
Default value is 10000 (10 seconds).
</description>
</property>
<property>
<name>ssl.client.keystore.location</name>
<value>/usr/local/hadoop/lib/keystore.jks</value>
<description>Keystore to be used by clients like distcp. Must be
specified.
</description>
</property>
<property>
<name>ssl.client.keystore.password</name>
<value>bigdata</value>
<description>Optional. Default value is "".
</description>
</property>
<property>
<name>ssl.client.keystore.keypassword</name>
<value>bigdata</value>
<description>Optional. Default value is "".
</description>
</property>
<property>
<name>ssl.client.keystore.type</name>
<value>jks</value>
<description>Optional. The keystore file format, default value is "jks".
</description>
</property>
</configuration>
ssl-server.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<configuration>
<property>
<name>ssl.server.truststore.location</name>
<value>/usr/local/hadoop/lib/keystore.jks</value>
<description>Truststore to be used by NN and DN. Must be specified.
</description>
</property>
<property>
<name>ssl.server.truststore.password</name>
<value>bigdata</value>
<description>Optional. Default value is "".
</description>
</property>
<property>
<name>ssl.server.truststore.type</name>
<value>jks</value>
<description>Optional. The keystore file format, default value is "jks".
</description>
</property>
<property>
<name>ssl.server.truststore.reload.interval</name>
<value>10000</value>
<description>Truststore reload check interval, in milliseconds.
Default value is 10000 (10 seconds).
</description>
</property>
<property>
<name>ssl.server.keystore.location</name>
<value>/usr/local/hadoop/lib/keystore.jks</value>
<description>Keystore to be used by NN and DN. Must be specified.
</description>
</property>
<property>
<name>ssl.server.keystore.password</name>
<value>bigdata</value>
<description>Must be specified</description>
</property>
<property>
<name>ssl.server.keystore.keypassword</name>
<value>bigdata</value>
<description>Must be specified.
</description>
</property>
<property>
<name>ssl.server.keystore.type</name>
<value>jks</value>
<description>Optional. The keystore file format, default value is "jks".
</description>
</property>
</configuration>
mapred-site.xml
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapreduce.jobhistory.keytab</name>
<value>/etc/security/keytabs/jhs.service.keytab</value>
</property>
<property>
<name>mapreduce.jobhistory.principal</name>
<value>jhs/_HOST@EXAMPLE.COM</value>
</property>
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>HOSTNAME:19888</value>
</property>
<property>
<name>mapreduce.jobhistory.webapp.https.address</name>
<value>HOSTNAME:19889</value>
</property>
<property>
<name>mapreduce.jobhistory.webapp.spnego-keytab-file</name>
<value>/etc/security/keytabs/spnego.service.keytab</value>
</property>
<property>
<name>mapreduce.jobhistory.webapp.spnego-principal</name>
<value>HTTP/_HOST@EXAMPLE.COM</value>
</property>
</configuration>
yarn-site.xml
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.application.classpath</name>
<value>/usr/local/hadoop/etc/hadoop, /usr/local/hadoop/share/hadoop/common/*, /usr/local/hadoop/share/hadoop/common/lib/*, /usr/local/hadoop/share/hadoop/hdfs/*, /usr/local/hadoop/share/hadoop/hdfs/lib/*, /usr/local/hadoop/share/hadoop/mapreduce/*, /usr/local/hadoop/share/hadoop/mapreduce/lib/*, /usr/local/hadoop/share/hadoop/yarn/*, /usr/local/hadoop/share/hadoop/yarn/lib/*</value>
</property>
<property>
<description>
Number of seconds after an application finishes before the nodemanager's
DeletionService will delete the application's localized file directory
and log directory.
To diagnose Yarn application problems, set this property's value large
enough (for example, to 600 = 10 minutes) to permit examination of these
directories. After changing the property's value, you must restart the
nodemanager in order for it to have an effect.
The roots of Yarn applications' work directories is configurable with
the yarn.nodemanager.local-dirs property (see below), and the roots
of the Yarn applications' log directories is configurable with the
yarn.nodemanager.log-dirs property (see also below).
</description>
<name>yarn.nodemanager.delete.debug-delay-sec</name>
<value>600</value>
</property>
<property>
<name>yarn.resourcemanager.principal</name>
<value>rm/HOSTNAME@EXAMPLE.COM</value>
</property>
<property>
<name>yarn.resourcemanager.keytab</name>
<value>/etc/security/keytabs/rm.service.keytab</value>
</property>
<property>
<name>yarn.nodemanager.principal</name>
<value>nm/HOSTNAME@EXAMPLE.COM</value>
</property>
<property>
<name>yarn.nodemanager.keytab</name>
<value>/etc/security/keytabs/nm.service.keytab</value>
</property>
<property>
<name>yarn.nodemanager.container-executor.class</name>
<value>org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor</value>
</property>
<property>
<name>yarn.nodemanager.linux-container-executor.path</name>
<value>/usr/local/hadoop/bin/container-executor</value>
</property>
<property>
<name>yarn.nodemanager.linux-container-executor.group</name>
<value>root</value>
</property>
<property>
<name>yarn.timeline-service.principal</name>
<value>yarn/HOSTNAME@EXAMPLE.COM</value>
</property>
<property>
<name>yarn.timeline-service.keytab</name>
<value>/etc/security/keytabs/yarn.service.keytab</value>
</property>
<property>
<name>yarn.resourcemanager.webapp.delegation-token-auth-filter.enabled</name>
<value>true</value>
</property>
<property>
<name>yarn.timeline-service.http-authentication.type</name>
<value>kerberos</value>
</property>
<property>
<name>yarn.timeline-service.http-authentication.kerberos.principal</name>
<value>HTTP/HOSTNAME@EXAMPLE.COM</value>
</property>
<property>
<name>yarn.timeline-service.http-authentication.kerberos.keytab</name>
<value>/etc/security/keytabs/yarn.service.keytab</value>
</property>
</configuration>
5. 调研 Hadoop 集群间单方开启 Kerberos 认证的数据同步
首先你先要确定几件事:
使用谁的资源进行传输:也就是 Map 程序运行在哪个集群上;
确保运行 Map 程序的 hadoop 集群安装了 yarn;
集群间 namenode 和 datanode 端口互通。
这里面推荐的实现方式:
有 Kerberos 认证的集群执行 distcp 命令最合适,只要端口通无需做任何修改;
在有 Kerberos 认证的集群开个 linux 用户,无认证的集群可远程操作。
6. 调研 Hadoop 集群间双方均开启 kerberos 认证的数据同步
在调研双向 kerberos 认证的时候遇到了阻力,源集群和目标集群都需要进行认证,而当时测试的节点只能部署其中 1 个集群的配置信息,导致调研工作阻塞,直到查看到这篇文章:https://cm.bigdata.pens.ac.id/static/help/topics/cdh_admin_distcp_data_cluster_migrate.html。
方案一:双⽅ Kerberos 互信
实现流程:
双方 KDC 配置文件 /etc/krb5.conf 都需要追加对方 realms、domain_realm 和 capaths;
分别同步双方 /etc/krb5.conf 到对应集群的 kerberos client 端;
重启双方 KDC:systemctl restart kadmin 和 systemctl restart krb5kdc;
增加互通身份: kadmin.local -q "addprinc dc/hw.com@ICE.COM" 和 kadmin.local -q "addprinc dc/ice.com@HW.COM";
在 core-site.xml 中增加⽤户的映射规则;
重启 hadoop 集群;
复制一份 core-site.xml 融合双方集群的 namenode 地址、datanode 地址作为 distcp 依赖的上下文信息。
该⽅案优点:
效率⾼:distcp 命令通过提交 mapreduce 程序在 map 阶段进⾏数据传输,基于 block 流读⼀次写⼀次⼤幅度提升拷贝速度;
稳定:distcp 命令⽀持 HA 的配置,同时提供了多个参数⽤于限制 map 数量、线程数和⽹络传输速度。
该⽅案缺点:
对系统侵⼊性⼤:修改了 Kerberos 服务和 Hadoop 集群相关的⼀系列的配置;
影响系统稳定性:修改了配置必然会重启集群验证,从 0 开始搭建⾄少重启 20 次,要放在⽣产环境就歇菜了;
耦合⾼难扩展:双⽅实际上都知道了彼此,真正的耦合在⼀起,后续如果再来个需要同步的 Kerberos 集群该如何处理;
定位问题难度⾼:双方都需要鉴权,排查鉴权失败问题需要同时定位两个集群。
方案二:基于安全中⽴的 HDFS 集群
该⽅案主要是基于⼀个双⽅信任的中⽴的 HDFS 集群。
该 HDFS 集群具体要求:
硬件要求:CPU 和 内存不需要太⾼,硬盘 100 TB ⾜够了;
⽹络要求:双方和中立集群 HDFS 互通,彼此不需要互通;
软件要求:相同 HDFS 版本,只安装 NameNode 和 DataNode 即可,不需要安装 YARN 和 Kerberos;
规范传输目录结构。
具体步骤:
源集群本地 Kerberos ⾝份鉴权成功后,使⽤ hadoop distcp 命令同步数据;
中⽴ HDS 集群监听推送数据的任务,推送数据任务完成,应该有机制能通知到目标集群;
目标集群接收到中⽴ HDFS 推送的推送任务完成事件,从中⽴的 HDFS 集群拉取数据,之后目标集群进⾏后续操作。
该⽅案优点:
对系统侵⼊性⼩,风险⼩,易扩展:双⽅的 Kerberos 服务和 Hadoop 集群不要任何修改,实际上降低了风险;
增加系统稳定性:因为使⽤了中⽴集群,相⽐⽅案⼀,数据同步过程中双⽅集群资源不会互相影响;
实现简单:只需要搭建 HDFS 集群,授权可访问的 IP 即可传输数据。
该⽅案缺点;
增加了数据同步时间:数据多拷贝了⼀次;
后续如果实现数据同步的⾃动化,需要实现任务完成事件的发布和订阅;
中⽴的 HDFS 集群增加了硬件成本。
版权声明: 本文为 InfoQ 作者【冰心的小屋】的原创文章。
原文链接:【http://xie.infoq.cn/article/2f61aae8426d9dd039ccb4b23】。文章转载请联系作者。
冰心的小屋
分享技术上的点滴收获! 2013-08-06 加入
一杯咖啡,一首老歌,一段代码,欢迎做客冰屋,享受编码和技术带来的快乐! 欢迎关注公众号:冰心的小屋
评论