用户
 找回密码
 立即注册

QQ登录

只需一步,快速开始

搜索
查看: 1514|回复: 0

Hadoop 2.6 BUG CapacityScheduler deadlock 线程死锁

[复制链接]

394

主题

412

帖子

2065

积分

管理员

Rank: 9Rank: 9Rank: 9

积分
2065

活跃会员热心会员推广达人宣传达人灌水之王突出贡献优秀版主荣誉管理论坛元老

发表于 2016-6-21 10:44:50 | 显示全部楼层 |阅读模式
最近遇到一个问题,当运行一个长作业时,在不停获取与计算队列信息时,ResourceManager 会因为内存不足导致OOM异常。原因是CapacityScheduler 的handle 函数死锁之后,后续的来自集群节点的event无法执行,导致ResourceManager 内部消息队列过载,最终咯P了。

经过几天的排查才定位到,比较恶心。 官方给出了相关的Patch, 地址为: https://issues.apache.org/jira/browse/YARN-3251

我遇到的问题,堆栈信息如下:

[Bash shell] 纯文本查看 复制代码
Found one Java-level deadlock:
=============================
"IPC Server handler 36 on 8032":
  waiting to lock monitor 0x00002ad1dbaaa6b8 (object 0x00000000c1c5cd98, a org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue),
  which is held by "ResourceManager Event Processor"
"ResourceManager Event Processor":
  waiting to lock monitor 0x00002ad1d952c148 (object 0x00000000c19e1fc8, a org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue),
  which is held by "IPC Server handler 36 on 8032"

Java stack information for the threads listed above:
===================================================
"IPC Server handler 36 on 8032":
        at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.getQueueInfo(LeafQueue.java:445)
        - waiting to lock <0x00000000c1c5cd98> (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue)
        at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.getQueueInfo(ParentQueue.java:221)
        - locked <0x00000000c19e1fc8> (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue)
        at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.getQueueInfo(CapacityScheduler.java:1105)
        at org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getQueueInfo(ClientRMService.java:852)
        at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getQueueInfo(ApplicationClientProtocolPBServiceImpl.java:293)
        at org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:495)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:628)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033)
"ResourceManager Event Processor":
        at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.AbstractCSQueue.getParent(AbstractCSQueue.java:204)
        - waiting to lock <0x00000000c19e1fc8> (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue)
        at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CSQueueUtils.getAbsoluteMaxAvailCapacity(CSQueueUtils.java:177)
        at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CSQueueUtils.getAbsoluteMaxAvailCapacity(CSQueueUtils.java:183)
        at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.computeUserLimitAndSetHeadroom(LeafQueue.java:1060)
        at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.checkLimitsToReserve(LeafQueue.java:1376)
        at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainerByWorkflowResource(LeafQueue.java:1740)
        at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainer(LeafQueue.java:1557)
        at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignOffSwitchContainers(LeafQueue.java:1435)
        at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainersOnNode(LeafQueue.java:1319)
        at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignReservedContainer(LeafQueue.java:925)
        - locked <0x00000000c1c5cd98> (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue)
        at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainers(LeafQueue.java:770)
        - locked <0x00000000c292bc90> (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp)
        - locked <0x00000000c1c5cd98> (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue)
        at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1190)
        - locked <0x00000000c189a360> (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler)
        at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1255)
        at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:132)
        at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:739)
        at java.lang.Thread.run(Thread.java:745)

Found 1 deadlock.

回复

使用道具 举报

您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

关闭

站长推荐 上一条 /4 下一条

返回顶部