Monday, December 1, 2014

Hadoop Troubleshooting: Incompatible Cluster IDs

Incompatible clusterIDs



2015-03-23 21:09:06,824 WARN  [IPC Server handler 5 on 9000] namenode.NameNode (NameNodeRpcServer.java:verifyRequest(1177)) - Registration IDs mismatched: the DatanodeRegistration ID is NS-1432439258-CID-e5f1aae5-2c67-487a-aa0e-5710e3b679e5-0 but the expected ID is NS-691282619-CID-efa3290f-7776-4cfc-8e92-a438d11abdd8-0
2015-03-23 21:09:06,825 INFO  [IPC Server handler 5 on 9000] ipc.Server (Server.java:run(2060)) - IPC Server handler 5 on 9000, call org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 192.168.1.201:37895 Call#13731 Retry#0
org.apache.hadoop.hdfs.protocol.UnregisteredNodeException: Unregistered server: DatanodeRegistration(192.168.1.201, datanodeUuid=5c2239a4-d14c-4549-8b87-f956ece5d946, infoPort=50075, ipcPort=50020, storageInfo=lv=-56;cid=CID-e5f1aae5-2c67-487a-aa0e-5710e3b679e5;nsid=1432439258;c=0)
 at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.verifyRequest(NameNodeRpcServer.java:1180)
 at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendHeartbeat(NameNodeRpcServer.java:1074)
 at org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolServerSideTranslatorPB.sendHeartbeat(DatanodeProtocolServerSideTranslatorPB.java:107)
 at org.apache.hadoop.hdfs.protocol.proto.DatanodeProtocolProtos$DatanodeProtocolService$2.callBlockingMethod(DatanodeProtocolProtos.java:26380)
 at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039)
 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:422)
 at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033)


Potential Causes

  1. The NameNode was formatted, and the DataNodes already had data in the HADOOP_DATA_DIR
    1. Triggered when I attempted to load data into the HDFS



Solution 1


The ClusterID from the NameNode needs to be copied onto the DataNode.

The first line of the NameNode should contain the error message that shows the incompatible cluster IDs:
the DatanodeRegistration ID is NS-1432439258-CID-e5f1aae5-2c67-487a-aa0e-5710e3b679e5-0 but the expected ID is NS-691282619-CID-efa3290f-7776-4cfc-8e92-a438d11abdd8-0
Copy the NameNode ClusterID to the VERSION file in the $HADOOP_DATA_DIR/data/current directory.

raig@dn02:~/HADOOP_DATA_DIR/data/current$ cat VERSION
#Mon Mar 23 09:37:50 PDT 2015
storageID=DS-821e9b0e-c3c8-478d-9829-9d1f721b84ed
clusterID=CID-e5f1aae5-2c67-487a-aa0e-5710e3b679e5
cTime=0
datanodeUuid=c9363e06-81bf-467f-958c-042292edd3bf
storageType=DATA_NODE
layoutVersion=-56

The datanodeUuid value (in red above) needs to be changed to the "expected ID" value found in the NameNode log file.

IMPORTANT:  If you are operating a multi-node cluster, do not copy this file (replacing all the existing VERSION files) on each data node in the cluster.  In doing so, you will overwrite the datanodeUuid and storageId values.

Doing this will trigger an UnregisteredNodeException error:
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.protocol.UnregisteredNodeException): Data node DatanodeRegistration(192.168.1.204, datanodeUuid=5c2239a4-d14c-4549-8b87-f956ece5d946, infoPort=50075, ipcPort=50020, storageInfo=lv=-56;cid=CID-efa3290f-7776-4cfc-8e92-a438d11abdd8;nsid=691282619;c=0) is attempting to report storage ID 5c2239a4-d14c-4549-8b87-f956ece5d946. Node 192.168.1.203:50010 is expected to serve this storage.
 at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.getDatanode(DatanodeManager.java:477)
 at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.processReport(BlockManager.java:1780)
 at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.blockReport(NameNodeRpcServer.java:1097)
 at org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolServerSideTranslatorPB.blockReport(DatanodeProtocolServerSideTranslatorPB.java:152)
 at org.apache.hadoop.hdfs.protocol.proto.DatanodeProtocolProtos$DatanodeProtocolService$2.callBlockingMethod(DatanodeProtocolProtos.java:26382)
 at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039)
 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:422)
 at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033)

 at org.apache.hadoop.ipc.Client.call(Client.java:1468)
 at org.apache.hadoop.ipc.Client.call(Client.java:1399)
 at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
 at com.sun.proxy.$Proxy12.blockReport(Unknown Source)
 at org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.blockReport(DatanodeProtocolClientSideTranslatorPB.java:175)
 at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.blockReport(BPServiceActor.java:492)
 at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:715)
 at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:850)
 at java.lang.Thread.run(Thread.java:745)
2015-03-23 21:48:49,812 WARN  [DataNode: [[[DISK]file:/home/craig/HADOOP_DATA_DIR/data/]]  heartbeating to master/192.168.1.70:9000] datanode.DataNode (BPServiceActor.java:run(861)) - Ending block pool service for: Block pool BP-494263941-10.0.4.15-1426797442562 (Datanode Uuid 5c2239a4-d14c-4549-8b87-f956ece5d946) service to master/192.168.1.70:9000
2015-03-23 21:48:49,913 INFO  [DataNode: [[[DISK]file:/home/craig/HADOOP_DATA_DIR/data/]]  heartbeating to master/192.168.1.70:9000] datanode.DataNode (BlockPoolManager.java:remove(103)) - Removed Block pool BP-494263941-10.0.4.15-1426797442562 (Datanode Uuid 5c2239a4-d14c-4549-8b87-f956ece5d946)
2015-03-23 21:48:49,914 INFO  [DataNode: [[[DISK]file:/home/craig/HADOOP_DATA_DIR/data/]]  heartbeating to master/192.168.1.70:9000] datanode.DataBlockScanner (DataBlockScanner.java:removeBlockPool(273)) - Removed bpid=BP-494263941-10.0.4.15-1426797442562 from blockPoolScannerMap
2015-03-23 21:48:49,914 INFO  [DataNode: [[[DISK]file:/home/craig/HADOOP_DATA_DIR/data/]]  heartbeating to master/192.168.1.70:9000] impl.FsDatasetImpl (FsDatasetImpl.java:shutdownBlockPool(2217)) - Removing block pool BP-494263941-10.0.4.15-1426797442562
2015-03-23 21:48:51,915 WARN  [main] datanode.DataNode (DataNode.java:secureMain(2392)) - Exiting Datanode
2015-03-23 21:48:51,918 INFO  [main] util.ExitUtil (ExitUtil.java:terminate(124)) - Exiting with status 0


and cause the DataNode to shutdown.



Solution 2 (Last Resort)

  1. Re-create the HADOOP_DATA_DIR
    mkdir -p $HADOOP_DATA_DIR/data  
    mkdir -p $HADOOP_DATA_DIR/name  
    mkdir -p $HADOOP_DATA_DIR/local  
    chmod 755 $HADOOP_DATA_DIR  
    

  2. Format your NameNode
    ./hdfs namenode -format
    

  3. Re-start Services

If your HADOOP_DATA_DIR has data that you don't want to lose, this isn't a great solution


References

  1. http://stackoverflow.com/questions/16020334/hadoop-datanode-process-killed
    1. "Problem could be that NN was formatted after cluster was set up and DN were not, so slaves are still referring to old NN."
  2. http://stackoverflow.com/questions/22316187/datanode-not-starts-correctly
    1. More fine-grained approach involving formatting individual clusterIDs, rather than brute force removal of entire HADOOP_DATA_DIR

5 comments:

  1. There are lots of information about hadoop have spread around the web, but this is a unique one according to me. The strategy you have updated here will make me to get to the next level in big data. Thanks for sharing this.


    Hadoop Training Chennai
    Hadoop Training in Chennai
    Big Data Training in Chennai

    ReplyDelete
  2. Though the hadoop online training gave me the much needed information about the basic hadoop concepts I learned more information like data, cloud, analytic grealy on thie website. Thanks for sharing.

    ReplyDelete
  3. It’s great to come across a blog every once in a while that isn’t the same out of date rehashed material. Fantastic read.
    Hadoop Training Institute In chennai

    amazon-web-services-training-in-bangalore

    ReplyDelete

  4. Wondeful post,very well explained.Thanks for sharing,extremly easy to understand.Python is higly expressive prograaming language.These days all IT industries suggest one to take course on python.
    Best Python Training in BTM Layout

    ReplyDelete