Monday, December 1, 2014

Hadoop Troubleshooting: GC overhead limit exceeded

GC overhead limit exceeded


 craigtrim@CVB:/usr/lib/apache/hadoop/2.5.2/bin$ hadoop jar sandbox-1.0-SNAPSHOT.jar dev.hadoop.sandbox.counter.WordCountRunner /nyt/1987/ /out2  
 Input Directory = /nyt/1987/  
 Output Directory = /out2  
 2014-12-01 18:37:07,381 INFO [main] Configuration.deprecation (Configuration.java:warnOnceIfDeprecated(1019)) - session.id is deprecated. Instead, use dfs.metrics.session-id  
 2014-12-01 18:37:07,385 INFO [main] jvm.JvmMetrics (JvmMetrics.java:init(76)) - Initializing JVM Metrics with processName=JobTracker, sessionId=  
 2014-12-01 18:37:07,743 WARN [main] mapreduce.JobSubmitter (JobSubmitter.java:copyAndConfigureFiles(150)) - Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.  
 ^[^N2014-12-01 18:37:14,857 INFO [main] input.FileInputFormat (FileInputFormat.java:listStatus(281)) - Total input paths to process : 106105  
 2014-12-01 18:37:17,228 INFO [main] mapreduce.JobSubmitter (JobSubmitter.java:submitJobInternal(396)) - number of splits:106105  
 2014-12-01 18:37:17,329 INFO [main] mapreduce.JobSubmitter (JobSubmitter.java:printTokens(479)) - Submitting tokens for job: job_local1925425636_0001  
 2014-12-01 18:37:17,359 WARN [main] conf.Configuration (Configuration.java:loadProperty(2368)) - file:/tmp/hadoop-craigtrim/mapred/staging/craigtrim1925425636/.staging/job_local1925425636_0001/job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval; Ignoring.  
 2014-12-01 18:37:17,362 WARN [main] conf.Configuration (Configuration.java:loadProperty(2368)) - file:/tmp/hadoop-craigtrim/mapred/staging/craigtrim1925425636/.staging/job_local1925425636_0001/job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts; Ignoring.  
 2014-12-01 18:37:17,465 WARN [main] conf.Configuration (Configuration.java:loadProperty(2368)) - file:/home/craigtrim/HADOOP_DATA_DIR/local/localRunner/craigtrim/job_local1925425636_0001/job_local1925425636_0001.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval; Ignoring.  
 2014-12-01 18:37:17,468 WARN [main] conf.Configuration (Configuration.java:loadProperty(2368)) - file:/home/craigtrim/HADOOP_DATA_DIR/local/localRunner/craigtrim/job_local1925425636_0001/job_local1925425636_0001.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts; Ignoring.  
 2014-12-01 18:37:17,476 INFO [main] mapreduce.Job (Job.java:submit(1289)) - The url to track the job: http://localhost:8080/  
 2014-12-01 18:37:17,479 INFO [main] mapreduce.Job (Job.java:monitorAndPrintJob(1334)) - Running job: job_local1925425636_0001  
 2014-12-01 18:37:17,482 INFO [Thread-5] mapred.LocalJobRunner (LocalJobRunner.java:createOutputCommitter(471)) - OutputCommitter set in config null  
 2014-12-01 18:37:17,489 INFO [Thread-5] mapred.LocalJobRunner (LocalJobRunner.java:createOutputCommitter(489)) - OutputCommitter is org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter  
 2014-12-01 18:37:18,484 INFO [main] mapreduce.Job (Job.java:monitorAndPrintJob(1355)) - Job job_local1925425636_0001 running in uber mode : false  
 2014-12-01 18:37:18,486 INFO [main] mapreduce.Job (Job.java:monitorAndPrintJob(1362)) - map 0% reduce 0%  
 2014-12-01 18:37:37,400 WARN [Thread-5] mapred.LocalJobRunner (LocalJobRunner.java:run(560)) - job_local1925425636_0001  
 java.lang.OutOfMemoryError: GC overhead limit exceeded  
      at java.util.HashMap.newNode(HashMap.java:1734)  
      at java.util.HashMap.putVal(HashMap.java:630)  
      at java.util.HashMap.putMapEntries(HashMap.java:514)  
      at java.util.HashMap.<init>(HashMap.java:489)  
      at org.apache.hadoop.conf.Configuration.<init>(Configuration.java:673)  
      at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:440)  
      at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.<init>(LocalJobRunner.java:217)  
      at org.apache.hadoop.mapred.LocalJobRunner$Job.getMapTaskRunnables(LocalJobRunner.java:272)  
      at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:517)  
 2014-12-01 18:37:38,354 INFO [main] mapreduce.Job (Job.java:monitorAndPrintJob(1375)) - Job job_local1925425636_0001 failed with state FAILED due to: NA  
 2014-12-01 18:37:38,432 INFO [main] mapreduce.Job (Job.java:monitorAndPrintJob(1380)) - Counters: 0  
 craigtrim@CVB:/usr/lib/apache/hadoop/2.5.2/bin$   


Potential Causes

  1. The NameNode was formatted, and the DataNodes already had data in the HADOOP_DATA_DIR
    1. Triggered when I attempted to load data into the HDFS


Solution

  1. Re-create the HADOOP_DATA_DIR
     mkdir -p $HADOOP_DATA_DIR/data  
     mkdir -p $HADOOP_DATA_DIR/name  
     mkdir -p $HADOOP_DATA_DIR/local  
     sudo chmod 755 $HADOOP_DATA_DIR  
    
  2. Format your NameNode
     cd $HADOOP_HOME/bin  
     dfs namenode -format  
    
  3. Re-start Services


Shortcomings

  1. If your HADOOP_DATA_DIR has data that you don't want to lose, this isn't a great solution


References

  1. http://stackoverflow.com/questions/16020334/hadoop-datanode-process-killed
    1. "Problem could be that NN was formatted after cluster was set up and DN were not, so slaves are still referring to old NN."
  2. http://stackoverflow.com/questions/22316187/datanode-not-starts-correctly
    1. More fine-grained approach involving formatting individual clusterIDs, rather than brute force removal of entire HADOOP_DATA_DIR

Hadoop Troubleshooting: Incompatible Cluster IDs

Incompatible clusterIDs



2015-03-23 21:09:06,824 WARN  [IPC Server handler 5 on 9000] namenode.NameNode (NameNodeRpcServer.java:verifyRequest(1177)) - Registration IDs mismatched: the DatanodeRegistration ID is NS-1432439258-CID-e5f1aae5-2c67-487a-aa0e-5710e3b679e5-0 but the expected ID is NS-691282619-CID-efa3290f-7776-4cfc-8e92-a438d11abdd8-0
2015-03-23 21:09:06,825 INFO  [IPC Server handler 5 on 9000] ipc.Server (Server.java:run(2060)) - IPC Server handler 5 on 9000, call org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 192.168.1.201:37895 Call#13731 Retry#0
org.apache.hadoop.hdfs.protocol.UnregisteredNodeException: Unregistered server: DatanodeRegistration(192.168.1.201, datanodeUuid=5c2239a4-d14c-4549-8b87-f956ece5d946, infoPort=50075, ipcPort=50020, storageInfo=lv=-56;cid=CID-e5f1aae5-2c67-487a-aa0e-5710e3b679e5;nsid=1432439258;c=0)
 at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.verifyRequest(NameNodeRpcServer.java:1180)
 at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendHeartbeat(NameNodeRpcServer.java:1074)
 at org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolServerSideTranslatorPB.sendHeartbeat(DatanodeProtocolServerSideTranslatorPB.java:107)
 at org.apache.hadoop.hdfs.protocol.proto.DatanodeProtocolProtos$DatanodeProtocolService$2.callBlockingMethod(DatanodeProtocolProtos.java:26380)
 at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039)
 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:422)
 at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033)


Potential Causes

  1. The NameNode was formatted, and the DataNodes already had data in the HADOOP_DATA_DIR
    1. Triggered when I attempted to load data into the HDFS



Solution 1


The ClusterID from the NameNode needs to be copied onto the DataNode.

The first line of the NameNode should contain the error message that shows the incompatible cluster IDs:
the DatanodeRegistration ID is NS-1432439258-CID-e5f1aae5-2c67-487a-aa0e-5710e3b679e5-0 but the expected ID is NS-691282619-CID-efa3290f-7776-4cfc-8e92-a438d11abdd8-0
Copy the NameNode ClusterID to the VERSION file in the $HADOOP_DATA_DIR/data/current directory.

raig@dn02:~/HADOOP_DATA_DIR/data/current$ cat VERSION
#Mon Mar 23 09:37:50 PDT 2015
storageID=DS-821e9b0e-c3c8-478d-9829-9d1f721b84ed
clusterID=CID-e5f1aae5-2c67-487a-aa0e-5710e3b679e5
cTime=0
datanodeUuid=c9363e06-81bf-467f-958c-042292edd3bf
storageType=DATA_NODE
layoutVersion=-56

The datanodeUuid value (in red above) needs to be changed to the "expected ID" value found in the NameNode log file.

IMPORTANT:  If you are operating a multi-node cluster, do not copy this file (replacing all the existing VERSION files) on each data node in the cluster.  In doing so, you will overwrite the datanodeUuid and storageId values.

Doing this will trigger an UnregisteredNodeException error:
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.protocol.UnregisteredNodeException): Data node DatanodeRegistration(192.168.1.204, datanodeUuid=5c2239a4-d14c-4549-8b87-f956ece5d946, infoPort=50075, ipcPort=50020, storageInfo=lv=-56;cid=CID-efa3290f-7776-4cfc-8e92-a438d11abdd8;nsid=691282619;c=0) is attempting to report storage ID 5c2239a4-d14c-4549-8b87-f956ece5d946. Node 192.168.1.203:50010 is expected to serve this storage.
 at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.getDatanode(DatanodeManager.java:477)
 at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.processReport(BlockManager.java:1780)
 at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.blockReport(NameNodeRpcServer.java:1097)
 at org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolServerSideTranslatorPB.blockReport(DatanodeProtocolServerSideTranslatorPB.java:152)
 at org.apache.hadoop.hdfs.protocol.proto.DatanodeProtocolProtos$DatanodeProtocolService$2.callBlockingMethod(DatanodeProtocolProtos.java:26382)
 at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039)
 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:422)
 at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033)

 at org.apache.hadoop.ipc.Client.call(Client.java:1468)
 at org.apache.hadoop.ipc.Client.call(Client.java:1399)
 at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
 at com.sun.proxy.$Proxy12.blockReport(Unknown Source)
 at org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.blockReport(DatanodeProtocolClientSideTranslatorPB.java:175)
 at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.blockReport(BPServiceActor.java:492)
 at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:715)
 at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:850)
 at java.lang.Thread.run(Thread.java:745)
2015-03-23 21:48:49,812 WARN  [DataNode: [[[DISK]file:/home/craig/HADOOP_DATA_DIR/data/]]  heartbeating to master/192.168.1.70:9000] datanode.DataNode (BPServiceActor.java:run(861)) - Ending block pool service for: Block pool BP-494263941-10.0.4.15-1426797442562 (Datanode Uuid 5c2239a4-d14c-4549-8b87-f956ece5d946) service to master/192.168.1.70:9000
2015-03-23 21:48:49,913 INFO  [DataNode: [[[DISK]file:/home/craig/HADOOP_DATA_DIR/data/]]  heartbeating to master/192.168.1.70:9000] datanode.DataNode (BlockPoolManager.java:remove(103)) - Removed Block pool BP-494263941-10.0.4.15-1426797442562 (Datanode Uuid 5c2239a4-d14c-4549-8b87-f956ece5d946)
2015-03-23 21:48:49,914 INFO  [DataNode: [[[DISK]file:/home/craig/HADOOP_DATA_DIR/data/]]  heartbeating to master/192.168.1.70:9000] datanode.DataBlockScanner (DataBlockScanner.java:removeBlockPool(273)) - Removed bpid=BP-494263941-10.0.4.15-1426797442562 from blockPoolScannerMap
2015-03-23 21:48:49,914 INFO  [DataNode: [[[DISK]file:/home/craig/HADOOP_DATA_DIR/data/]]  heartbeating to master/192.168.1.70:9000] impl.FsDatasetImpl (FsDatasetImpl.java:shutdownBlockPool(2217)) - Removing block pool BP-494263941-10.0.4.15-1426797442562
2015-03-23 21:48:51,915 WARN  [main] datanode.DataNode (DataNode.java:secureMain(2392)) - Exiting Datanode
2015-03-23 21:48:51,918 INFO  [main] util.ExitUtil (ExitUtil.java:terminate(124)) - Exiting with status 0


and cause the DataNode to shutdown.



Solution 2 (Last Resort)

  1. Re-create the HADOOP_DATA_DIR
    mkdir -p $HADOOP_DATA_DIR/data  
    mkdir -p $HADOOP_DATA_DIR/name  
    mkdir -p $HADOOP_DATA_DIR/local  
    chmod 755 $HADOOP_DATA_DIR  
    

  2. Format your NameNode
    ./hdfs namenode -format
    

  3. Re-start Services

If your HADOOP_DATA_DIR has data that you don't want to lose, this isn't a great solution


References

  1. http://stackoverflow.com/questions/16020334/hadoop-datanode-process-killed
    1. "Problem could be that NN was formatted after cluster was set up and DN were not, so slaves are still referring to old NN."
  2. http://stackoverflow.com/questions/22316187/datanode-not-starts-correctly
    1. More fine-grained approach involving formatting individual clusterIDs, rather than brute force removal of entire HADOOP_DATA_DIR