用户
 找回密码
 立即注册

QQ登录

只需一步,快速开始

搜索
查看: 1606|回复: 1

Spark DriverDisassociated ! Shutting down. Lostexecutor 问题

[复制链接]

394

主题

412

帖子

2065

积分

管理员

Rank: 9Rank: 9Rank: 9

积分
2065

活跃会员热心会员推广达人宣传达人灌水之王突出贡献优秀版主荣誉管理论坛元老

发表于 2016-2-16 10:16:55 | 显示全部楼层 |阅读模式
今天使用集群做业务处理时遇到如下错误:
[Bash shell] 纯文本查看 复制代码
16/02/16 09:47:59 INFO org.apache.spark.Logging$class.logInfo(Logging.scala:59): Finished task 3759.0 in stage 0.0 (TID 3898). 4211 bytes result sent to driver
16/02/16 09:47:59 INFO org.apache.spark.Logging$class.logInfo(Logging.scala:59): Got assigned task 4059
16/02/16 09:47:59 INFO org.apache.spark.Logging$class.logInfo(Logging.scala:59): Running task 4044.0 in stage 0.0 (TID 4059)
16/02/16 09:47:59 INFO org.apache.spark.Logging$class.logInfo(Logging.scala:59): Input split: /Test_yang/zxy_20151215/WF1-4_sort_ico_12:2171080782096+536864684
16/02/16 09:48:16 INFO org.apache.spark.Logging$class.logInfo(Logging.scala:59): Finished task 4044.0 in stage 0.0 (TID 4059). 4211 bytes result sent to driver
16/02/16 09:48:16 INFO org.apache.spark.Logging$class.logInfo(Logging.scala:59): Got assigned task 4220
16/02/16 09:48:16 INFO org.apache.spark.Logging$class.logInfo(Logging.scala:59): Running task 4274.0 in stage 0.0 (TID 4220)
16/02/16 09:48:16 INFO org.apache.spark.Logging$class.logInfo(Logging.scala:59): Input split: /Test_yang/zxy_20151215/WF1-4_sort_ico_12:2294559659416+536864684
16/02/16 09:48:23 INFO org.apache.spark.Logging$class.logInfo(Logging.scala:59): Finished task 4274.0 in stage 0.0 (TID 4220). 4211 bytes result sent to driver
16/02/16 09:48:23 INFO org.apache.spark.Logging$class.logInfo(Logging.scala:59): Got assigned task 4294
16/02/16 09:48:23 INFO org.apache.spark.Logging$class.logInfo(Logging.scala:59): Running task 4305.0 in stage 0.0 (TID 4294)
16/02/16 09:48:23 INFO org.apache.spark.Logging$class.logInfo(Logging.scala:59): Input split: /Test_yang/zxy_20151215/WF1-4_sort_ico_12:2311202464620+536864684
16/02/16 09:48:34 INFO org.apache.spark.Logging$class.logInfo(Logging.scala:59): Finished task 4305.0 in stage 0.0 (TID 4294). 4211 bytes result sent to driver
16/02/16 09:48:35 INFO org.apache.spark.Logging$class.logInfo(Logging.scala:59): Got assigned task 4398
16/02/16 09:48:35 INFO org.apache.spark.Logging$class.logInfo(Logging.scala:59): Running task 4352.0 in stage 0.0 (TID 4398)
16/02/16 09:48:35 INFO org.apache.spark.Logging$class.logInfo(Logging.scala:59): Input split: /Test_yang/zxy_20151215/WF1-4_sort_ico_12:2336435104768+536864684
16/02/16 09:48:39 ERROR org.apache.spark.Logging$class.logError(Logging.scala:75): Driver 192.168.146.35:45087 disassociated! Shutting down.
16/02/16 09:48:39 WARN akka.event.slf4j.Slf4jLogger$$anonfun$receive$1$$anonfun$applyOrElse$2.apply$mcV$sp(Slf4jLogger.scala:71): Association with remote system [akka.tcp://sparkDriver@192.168.146.35:45087] has failed, address is now gated for [5000] ms. Reason: [Disassociated]
16/02/16 09:48:39 INFO org.apache.spark.Logging$class.logInfo(Logging.scala:59): Shutdown hook called
16/02/16 09:48:40 INFO org.apache.spark.Logging$class.logInfo(Logging.scala:59): Shutdown hook called
16/02/16 09:48:40 ERROR org.apache.spark.Logging$class.logError(Logging.scala:96): Exception in task 4352.0 in stage 0.0 (TID 4398)
java.io.IOException: Filesystem closed
        at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:795)
        at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:786)
        at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:847)
        at java.io.DataInputStream.readFully(DataInputStream.java:195)
        at java.io.DataInputStream.readFully(DataInputStream.java:169)
        at sinopec.pi.pisparkapp.hadoop.ioformat.SeisDataInputFormat$SeisDataRecordReader.nextKeyValue(SeisDataInputFormat.java:199)
        at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:143)
        at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
        at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:211)
        at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:73)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)


随后登陆到出问题的那个节点发现如下日志(NodeManager log):
[Bash shell] 纯文本查看 复制代码
2016-02-16 09:48:45,215 INFO org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger.logSuccess(NMAuditLogger.java:89): USER=pai        OPERATION=Container Finished - Killed   TARGET=ContainerImpl    RESULT=SUCCESS  APPID=application_1453342938261_0112    CONTAINERID=container_1453342938261_0112_01_000001


再次提交这个作业,发现每个exectuor使用的内存很多
14.jpg

问题分析
1. 单个节点为64G内存,多个Container启动后内存严重不足,有可能造成Driver 内存分配失败
2. 只有运行大数据处理时会遇到这个问题(1T以上数据)

总结
造成这个问题的有可能是分配内存失败, 尝试增加driver内存(spark.yarn.driver.memoryOverhead)后问题解决。
每个executor 都会尝试连接driver, 当driver 意外崩溃或网络延迟时会造成executor连接不上driver的错误,建议增加超时时间或优化网络避免出现如下问题
[Bash shell] 纯文本查看 复制代码
16/02/16 09:48:39 WARN akka.event.slf4j.Slf4jLogger$$anonfun$receive$1$$anonfun$applyOrElse$2.apply$mcV$sp(Slf4jLogger.scala:71): Association with remote system [akka.tcp://sparkDriver@192.168.146.35:45087] has failed, address is now gated for [5000] ms. Reason: [Disassociated]

回复

使用道具 举报

0

主题

1

帖子

14

积分

新手上路

Rank: 1

积分
14
发表于 2020-9-15 17:22:49 | 显示全部楼层
Roku is a streaming device, which is a reasonable roku activation and other Set-up Box. Roku is a bundle of amusement, where client can stream for boundless motion appears, web arrangement, news, animation and a lot more projects.
回复 支持 反对

使用道具 举报

您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

关闭

站长推荐 上一条 /4 下一条

返回顶部