【Hive】 HiveServer2 内存溢出总结
- 2024-08-02 北京
本文字数:6785 字
阅读完需:约 22 分钟
1.前言
用户使用 Beeline 访问 HiveServer2 (3.1.2 版本) 执行离线 SQL 任务,持续运行一周后 HiveServer2 就出现 OOM 现象,严重影响数据查询与报表产出,经过几轮修复问题终于解决。作者把修复过的问题进行了汇总,避免其他小伙伴再遇到此问题时束手无策。
2.案例
2.1 HIVE-16455
HiveServer2 在使用 ADD JAR 语句时导致文件句柄泄漏
[root@host-10-17-80-111 ~]# lsof -p 29588 | grep "(deleted)" | wc -l
java 29588 hive 391u REG 252,3 125987 2099944 /tmp/57d98f5b-1e53-44e2-876b-6b4323ac24db_resources/hive-contrib.jar (deleted)
java 29588 hive 392u REG 252,3 125987 2099946 /tmp/eb3184ad-7f15-4a77-a10d-87717ae634d1_resources/hive-contrib.jar (deleted)
java 29588 hive 393r REG 252,3 125987 2099825 /tmp/e29dccfc-5708-4254-addb-7a8988fc0500_resources/hive-contrib.jar (deleted)
java 29588 hive 394r REG 252,3 125987 2099833 /tmp/5153dd4a-a606-4f53-b02c-d606e7e56985_resources/hive-contrib.jar (deleted)
java 29588 hive 395r REG 252,3 125987 2099827 /tmp/ff3cdb05-917f-43c0-830a-b293bf397a23_resources/hive-contrib.jar (deleted)
java 29588 hive 396r REG 252,3 125987 2099822 /tmp/60531b66-5985-421e-8eb5-eeac31fdf964_resources/hive-contrib.jar (deleted)
java 29588 hive 397r REG 252,3 125987 2099831 /tmp/78878921-455c-438c-9735-447566ed8381_resources/hive-contrib.jar (deleted)
java 29588 hive 399r REG 252,3 125987 2099835 /tmp/0e5d7990-30cc-4248-9058-587f7f1ff211_resources/hive-contrib.jar (deleted)
2.2 HIVE-24236
不容易复现,只能某些特定条件下可能存在连接泄漏风险
2020-09-29T18:44:26,563 INFO [Heartbeater-0]: txn.TxnHandler (TxnHandler.java:checkRetryable(3733)) - Non-retryable error in heartbeat(HeartbeatRequest(lockid:0, txnid:11908)) : Cannot get a connection, general error (SQLState=null, ErrorCode=0)
2020-09-29T18:44:26,564 ERROR [Heartbeater-0]: metastore.RetryingHMSHandler (RetryingHMSHandler.java:invokeInternal(201)) - MetaException(message:Unable to select from transaction database org.apache.commons.dbcp.SQLNestedException: Cannot get a connection, general error
at org.apache.commons.dbcp.PoolingDataSource.getConnection(PoolingDataSource.java:118)
at org.apache.hadoop.hive.metastore.txn.TxnHandler.getDbConn(TxnHandler.java:3605)
at org.apache.hadoop.hive.metastore.txn.TxnHandler.getDbConn(TxnHandler.java:3598)
at org.apache.hadoop.hive.metastore.txn.TxnHandler.heartbeat(TxnHandler.java:2739)
at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.heartbeat(HiveMetaStore.java:8452)
at sun.reflect.GeneratedMethodAccessor415.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.hive.metastore.RetryingHMSHandler.invokeInternal(RetryingHMSHandler.java:147)
at org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:108)
at com.sun.proxy.$Proxy63.heartbeat(Unknown Source)
at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.heartbeat(HiveMetaStoreClient.java:3247)
at sun.reflect.GeneratedMethodAccessor414.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:213)
at com.sun.proxy.$Proxy64.heartbeat(Unknown Source)
at org.apache.hadoop.hive.ql.lockmgr.DbTxnManager.heartbeat(DbTxnManager.java:671)
at org.apache.hadoop.hive.ql.lockmgr.DbTxnManager$Heartbeater.lambda$run$0(DbTxnManager.java:1102)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1898)
at org.apache.hadoop.hive.ql.lockmgr.DbTxnManager$Heartbeater.run(DbTxnManager.java:1101)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
2.3 HIVE-24552
调用 loadDynamicPartitions(Hive.java)时生成多个线程来处理 FileMove,这些线程可能会生成 HiveMetaStore 连接,这些连接可能没有及时关闭造成大量的连接堆积。
2020-12-15T17:05:38.485Z hiveserver2-0 hiveserver2 1 a3671b96-74fb-4ee9-b186-aeff0de0bbec [mdc@18060 class="metastore.HiveMetaStoreClient" level="INFO" thread="Finalizer"] Closed a connection to metastore, current connections: 43901
2020-12-15T17:05:38.485Z hiveserver2-0 hiveserver2 1 a3671b96-74fb-4ee9-b186-aeff0de0bbec [mdc@18060 class="metastore.HiveMetaStoreClient" level="INFO" thread="Finalizer"] Closed a connection to metastore, current connections: 43900
2020-12-15T17:05:38.485Z hiveserver2-0 hiveserver2 1 a3671b96-74fb-4ee9-b186-aeff0de0bbec [mdc@18060 class="metastore.HiveMetaStoreClient" level="INFO" thread="Finalizer"] Closed a connection to metastore, current connections: 43899
2020-12-15T17:05:38.485Z hiveserver2-0 hiveserver2 1 a3671b96-74fb-4ee9-b186-aeff0de0bbec [mdc@18060 class="metastore.HiveMetaStoreClient" level="INFO" thread="Finalizer"] Closed a connection to metastore, current connections: 43898
2020-12-15T17:05:38.485Z hiveserver2-0 hiveserver2 1 a3671b96-74fb-4ee9-b186-aeff0de0bbec [mdc@18060 class="metastore.HiveMetaStoreClient" level="INFO" thread="Finalizer"] Closed a connection to metastore, current connections: 43897
2020-12-15T17:05:38.485Z hiveserver2-0 hiveserver2 1 a3671b96-74fb-4ee9-b186-aeff0de0bbec [mdc@18060 class="transport.TIOStreamTransport" level="WARN" thread="Finalizer"] Error closing output stream.
java.net.SocketException: Socket closed
at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:118)
at java.net.SocketOutputStream.write(SocketOutputStream.java:155)
at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
at java.io.FilterOutputStream.close(FilterOutputStream.java:158)
2.4 HIVE-24858
如果在会话中注册了一个 UDF JAR 并从中创建了一个临时函数,当会话关闭时 UDFClassLoader 不会被 GC 回收掉。
Class Name | Shallow Heap | Retained Heap
-------------------------------------------------------------------------------------------------------------------------------------------------------------------
contextClassLoader org.apache.hive.service.server.ThreadWithGarbageCleanup @ 0x7164deb50 HiveServer2-Handler-Pool: Thread-72 Thread| 128 | 79,072
referent java.util.WeakHashMap$Entry @ 0x7164e67d0 | 40 | 824
'- [6] java.util.WeakHashMap$Entry[16] @ 0x71581aac0 | 80 | 5,056
'- table java.util.WeakHashMap @ 0x71580f510 | 48 | 6,920
'- CACHE_CLASSES class org.apache.hadoop.conf.Configuration @ 0x71580f3d8 | 64 | 74,528
-------------------------------------------------------------------------------------------------------------------------------------------------------------------
2.5 HIVE-26404
HiveMetaStore 无法响应 JVM 垃圾回收停顿时间长,堆内存 org.apache.hadoop.conf.Configuration 占用过多存在 OOM 风险。
Class Name | Shallow Heap | Retained Heap
----------------------------------------------------------------------------------------------------------------------
org.apache.hadoop.fs.FileSystem$Cache @ 0x45403fe70 | 32 | 108,671,824
|- <class> class org.apache.hadoop.fs.FileSystem$Cache @ 0x45410c3e0 | 8 | 544
'- map java.util.HashMap @ 0x453ffb598 | 48 | 92,777,232
|- <class> class java.util.HashMap @ 0x4520382c8 System Class | 40 | 168
|- entrySet java.util.HashMap$EntrySet @ 0x454077848 | 16 | 16
'- table java.util.HashMap$Node[32768] @ 0x463585b68 | 131,088 | 92,777,168
|- class java.util.HashMap$Node[] @ 0x4520b7790 | 0 | 0
'- [1786] java.util.HashMap$Node @ 0x451998ce0 | 32 | 9,968
|- <class> class java.util.HashMap$Node @ 0x4520b7728 System Class | 8 | 32
'- value org.apache.hadoop.hdfs.DistributedFileSystem @ 0x452990178 | 56 | 4,976
|- <class> class org.apache.hadoop.hdfs.DistributedFileSystem @ 0x45402e290| 8 | 4,664
|- uri java.net.URI @ 0x451a05cd0 hdfs://nameservice1 | 80 | 432
|- dfs org.apache.hadoop.hdfs.DFSClient @ 0x451f5d9b8 | 128 | 3,824
'- conf org.apache.hadoop.hive.conf.HiveConf @ 0x453a34b38 | 80 | 250,160
----------------------------------------------------------------------------------------------------------------------
2.6 HIVE-22275
单个 Hive Session 执行多条 SQL 语时 OperationManager.queryIdOperation 没有正常清理存在 OOM 风险
2019-09-13T08:37:36,785 INFO [8eaa1601-f045-4ad5-9c2e-1e5944b75f6a HiveServer2-Handler-Pool: Thread-202]: operation.OperationManager (:()) - Adding operation: OperationHandle [opType=EXECUTE_STATEMENT, getHandleIdentifier()=dfed4c18-a284-4640-9f4a-1a20527105f9]
2019-09-13T08:37:38,432 INFO [8eaa1601-f045-4ad5-9c2e-1e5944b75f6a HiveServer2-Handler-Pool: Thread-202]: operation.OperationManager (:()) - Removed queryId: hive_20190913083736_c49cf3cc-cfe8-48a1-bd22-8b924dfb0396 corresponding to operation: OperationHandle [opType=EXECUTE_STATEMENT, getHandleIdentifier()=dfed4c18-a284-4640-9f4a-1a20527105f9] with tag: null
2019-09-13T08:37:38,469 INFO [8eaa1601-f045-4ad5-9c2e-1e5944b75f6a HiveServer2-Handler-Pool: Thread-202]: operation.OperationManager (:()) - Adding operation: OperationHandle [opType=EXECUTE_STATEMENT, getHandleIdentifier()=24d0030c-0e49-45fb-a918-2276f0941cfb]
2019-09-13T08:37:52,662 INFO [8eaa1601-f045-4ad5-9c2e-1e5944b75f6a HiveServer2-Handler-Pool: Thread-202]: operation.OperationManager (:()) - Adding operation: OperationHandle [opType=EXECUTE_STATEMENT, getHandleIdentifier()=b983802c-1dec-4fa0-8680-d05ab555321b]
2019-09-13T08:37:56,239 INFO [8eaa1601-f045-4ad5-9c2e-1e5944b75f6a HiveServer2-Handler-Pool: Thread-202]: operation.OperationManager (:()) - Adding operation: OperationHandle [opType=EXECUTE_STATEMENT, getHandleIdentifier()=75dbc531-2964-47b2-84d7-85b59f88999c]
2019-09-13T08:38:30,791 INFO [8eaa1601-f045-4ad5-9c2e-1e5944b75f6a HiveServer2-Handler-Pool: Thread-202]: operation.OperationManager (:()) - Adding operation: OperationHandle [opType=EXECUTE_STATEMENT, getHandleIdentifier()=b697c801-7da0-4544-bcfa-442eb1d3bd77]
2019-09-13T08:39:10,187 INFO [8eaa1601-f045-4ad5-9c2e-1e5944b75f6a HiveServer2-Handler-Pool: Thread-202]: operation.OperationManager (:()) - Adding operation: OperationHandle [opType=EXECUTE_STATEMENT, getHandleIdentifier()=bda93c8f-0822-4592-a61c-4701720a1a5c]
2019-09-13T08:39:15,471 INFO [8eaa1601-f045-4ad5-9c2e-1e5944b75f6a HiveServer2-Handler-Pool: Thread-202]: operation.OperationManager (:()) - Removed queryId: hive_20190913083910_c4809ca8-d8db-423c-8b6d-fbe3eee89971 corresponding to operation: OperationHandle [opType=EXECUTE_STATEMENT, getHandleIdentifier()=24d0030c-0e49-45fb-a918-2276f0941cfb] with tag: null
2019-09-13T08:39:15,507 INFO [8eaa1601-f045-4ad5-9c2e-1e5944b75f6a HiveServer2-Handler-Pool: Thread-202]: operation.OperationManager (:()) - Removed queryId: hive_20190913083910_c4809ca8-d8db-423c-8b6d-fbe3eee89971 corresponding to operation: OperationHandle [opType=EXECUTE_STATEMENT, getHandleIdentifier()=b983802c-1dec-4fa0-8680-d05ab555321b] with tag: null
2019-09-13T08:39:15,538 INFO [8eaa1601-f045-4ad5-9c2e-1e5944b75f6a HiveServer2-Handler-Pool: Thread-202]: operation.OperationManager (:()) - Removed queryId: hive_20190913083910_c4809ca8-d8db-423c-8b6d-fbe3eee89971 corresponding to operation: OperationHandle [opType=EXECUTE_STATEMENT, getHandleIdentifier()=75dbc531-2964-47b2-84d7-85b59f88999c] with tag: null
2.7 HIVE-24590
日志输出文件没有正常关闭或删除,Log4j 中的 RandomAccessFileManager 实例占用堆内存空间过多存在 OOM 风险。
3.总结
笔者使用 HiveServer2 版本为 3.1.2,由于此版本内存泄漏问题较多,大家可根据上述案例进行编译修复,如遇到其他 BUG 或性能问题,建议多去社区看看。
版权声明: 本文为 InfoQ 作者【扬_帆_起_航】的原创文章。
原文链接:【http://xie.infoq.cn/article/647b7e717b7edc87a7d692e36】。文章转载请联系作者。
扬_帆_起_航
尘世中一个迷途小书童! 2020-03-09 加入
大数据领域从业者,近几年一直从事Kafka领域相关工作
评论