今天使用集群做业务处理时遇到如下错误:
[Bash shell] 纯文本查看 复制代码 16/02/16 09:47:59 INFO org.apache.spark.Logging$class.logInfo(Logging.scala:59): Finished task 3759.0 in stage 0.0 (TID 3898). 4211 bytes result sent to driver
16/02/16 09:47:59 INFO org.apache.spark.Logging$class.logInfo(Logging.scala:59): Got assigned task 4059
16/02/16 09:47:59 INFO org.apache.spark.Logging$class.logInfo(Logging.scala:59): Running task 4044.0 in stage 0.0 (TID 4059)
16/02/16 09:47:59 INFO org.apache.spark.Logging$class.logInfo(Logging.scala:59): Input split: /Test_yang/zxy_20151215/WF1-4_sort_ico_12:2171080782096+536864684
16/02/16 09:48:16 INFO org.apache.spark.Logging$class.logInfo(Logging.scala:59): Finished task 4044.0 in stage 0.0 (TID 4059). 4211 bytes result sent to driver
16/02/16 09:48:16 INFO org.apache.spark.Logging$class.logInfo(Logging.scala:59): Got assigned task 4220
16/02/16 09:48:16 INFO org.apache.spark.Logging$class.logInfo(Logging.scala:59): Running task 4274.0 in stage 0.0 (TID 4220)
16/02/16 09:48:16 INFO org.apache.spark.Logging$class.logInfo(Logging.scala:59): Input split: /Test_yang/zxy_20151215/WF1-4_sort_ico_12:2294559659416+536864684
16/02/16 09:48:23 INFO org.apache.spark.Logging$class.logInfo(Logging.scala:59): Finished task 4274.0 in stage 0.0 (TID 4220). 4211 bytes result sent to driver
16/02/16 09:48:23 INFO org.apache.spark.Logging$class.logInfo(Logging.scala:59): Got assigned task 4294
16/02/16 09:48:23 INFO org.apache.spark.Logging$class.logInfo(Logging.scala:59): Running task 4305.0 in stage 0.0 (TID 4294)
16/02/16 09:48:23 INFO org.apache.spark.Logging$class.logInfo(Logging.scala:59): Input split: /Test_yang/zxy_20151215/WF1-4_sort_ico_12:2311202464620+536864684
16/02/16 09:48:34 INFO org.apache.spark.Logging$class.logInfo(Logging.scala:59): Finished task 4305.0 in stage 0.0 (TID 4294). 4211 bytes result sent to driver
16/02/16 09:48:35 INFO org.apache.spark.Logging$class.logInfo(Logging.scala:59): Got assigned task 4398
16/02/16 09:48:35 INFO org.apache.spark.Logging$class.logInfo(Logging.scala:59): Running task 4352.0 in stage 0.0 (TID 4398)
16/02/16 09:48:35 INFO org.apache.spark.Logging$class.logInfo(Logging.scala:59): Input split: /Test_yang/zxy_20151215/WF1-4_sort_ico_12:2336435104768+536864684
16/02/16 09:48:39 ERROR org.apache.spark.Logging$class.logError(Logging.scala:75): Driver 192.168.146.35:45087 disassociated! Shutting down.
16/02/16 09:48:39 WARN akka.event.slf4j.Slf4jLogger$$anonfun$receive$1$$anonfun$applyOrElse$2.apply$mcV$sp(Slf4jLogger.scala:71): Association with remote system [akka.tcp://sparkDriver@192.168.146.35:45087] has failed, address is now gated for [5000] ms. Reason: [Disassociated]
16/02/16 09:48:39 INFO org.apache.spark.Logging$class.logInfo(Logging.scala:59): Shutdown hook called
16/02/16 09:48:40 INFO org.apache.spark.Logging$class.logInfo(Logging.scala:59): Shutdown hook called
16/02/16 09:48:40 ERROR org.apache.spark.Logging$class.logError(Logging.scala:96): Exception in task 4352.0 in stage 0.0 (TID 4398)
java.io.IOException: Filesystem closed
at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:795)
at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:786)
at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:847)
at java.io.DataInputStream.readFully(DataInputStream.java:195)
at java.io.DataInputStream.readFully(DataInputStream.java:169)
at sinopec.pi.pisparkapp.hadoop.ioformat.SeisDataInputFormat$SeisDataRecordReader.nextKeyValue(SeisDataInputFormat.java:199)
at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:143)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:211)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
随后登陆到出问题的那个节点发现如下日志(NodeManager log):
[Bash shell] 纯文本查看 复制代码 2016-02-16 09:48:45,215 INFO org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger.logSuccess(NMAuditLogger.java:89): USER=pai OPERATION=Container Finished - Killed TARGET=ContainerImpl RESULT=SUCCESS APPID=application_1453342938261_0112 CONTAINERID=container_1453342938261_0112_01_000001
再次提交这个作业,发现每个exectuor使用的内存很多
问题分析
1. 单个节点为64G内存,多个Container启动后内存严重不足,有可能造成Driver 内存分配失败
2. 只有运行大数据处理时会遇到这个问题(1T以上数据)
总结
造成这个问题的有可能是分配内存失败, 尝试增加driver内存(spark.yarn.driver.memoryOverhead)后问题解决。
每个executor 都会尝试连接driver, 当driver 意外崩溃或网络延迟时会造成executor连接不上driver的错误,建议增加超时时间或优化网络避免出现如下问题
[Bash shell] 纯文本查看 复制代码 16/02/16 09:48:39 WARN akka.event.slf4j.Slf4jLogger$$anonfun$receive$1$$anonfun$applyOrElse$2.apply$mcV$sp(Slf4jLogger.scala:71): Association with remote system [akka.tcp://sparkDriver@192.168.146.35:45087] has failed, address is now gated for [5000] ms. Reason: [Disassociated]
|