Thursday, November 20, 2014

Configuring the HDFS Base Node

Perform a base configuration of a single Linux VM with Hadoop. Once the base configuration is complete, we’ll clone this VM into the “name node” and the “slave nodes”.

In some ways, I feel that this section represents the trickiest part in initially configuring the HDFS cluster.  It's easy to get a property name or value wrong, or to leave something important out.  Likewise, the names of properties often change.  I really don't understand why Hadoop can't maintain backward compatibility, and what the rationale for seemingly minor changes like "fs.default.name" to "fs.defaultFS", etc.


Contents of this Article

  1. Disabling IPv6
  2. Directory Configuration
    1. Data (HADOOP_DATA_DIR)
    2. Configuration Files (HADOOP_CONF_DIR)
    3. Logs (HADOOP_LOG_DIR)
  3. Script Configuration
    1. hadoop-env.xml
    2. core-site.xml
    3. hdfs-site.xml
    4. mapred-site.xml
    5. capacity-scheduler.xml
  4. Other Files
    1. masters
  5. Start-up Scripts

Disabling IP v6


Apache Hadoop is not currently supported on IPv6 networks. It has only been tested and developed on IPv4 stacks. Hadoop needs IPv4 to work, and only IPv4 clients can talk to the cluster.If your organisation moves to IPv6 only, you will encounter problems. Some Linux releases default to being IPv6 only. That means unless the systems are configured to re-enable IPv4, some machines will break. As of Jan 2010, this was causing problems in Debian

Commands

 sudo gedit /etc/sysctl.conf  

Now add these lines at the end of file:
 # IPv6  
 net.ipv6.conf.all.disable_ipv6 = 1  
 net.ipv6.conf.default.disable_ipv6 = 1  
 net.ipv6.conf.lo.disable_ipv6 = 1  
Save and close the file and run this command:
 sudo sysctl -p  

The purpose of this command is:
 sysctl - configure kernel parameters at runtime  
 -p  
 Load in sysctl settings from the file specified or /etc/sysctl.conf if none given. Specifying - as filename means reading data from standard input.  



HADOOP_DATA_DIR


A directory needs to exist on every machine for storing HDFS data. Determines where HDFS should sit. This is really important because HDFS should be sitting on a filesystem that is local to each cluster node

Commands

Edit the /etc/environment file
 export HADOOP_DATA_DIR=/home/craigtrim/HADOOP_DATA_DIR  
In the home directory, run the following commands
 mkdir -p $HADOOP_DATA_DIR/data  
 mkdir -p $HADOOP_DATA_DIR/name  
 mkdir -p $HADOOP_DATA_DIR/local  
 sudo chmod 755 $HADOOP_DATA_DIR  

Troubleshooting

It’s important to use the absolute path in the /etc/environment file. For some reason, some of the other scripts that use this variable don’t do well with the “~” shortcut.


HADOOP_CONF_DIR


A configuration folder needs to exist for configuration scripts.

Commands

 mkdir $HADOOP_HOME/conf  
 sudo chmod 755 $HADOOP_HOME/conf  
and then edit the /etc/environment file
 export HADOOP_CONF_DIR=$HADOOP_HOME/conf  



HADOOP_CONF_LOG

Determines where Hadoop's logs (including all job errors) should go. This is extremely important, as debugging job failures is impossible without looking at the logs kept here

Commands

 mkdir $HADOOP_HOME/logs  
 sudo chmod 777 $HADOOP_HOME/logs  
and then edit the /etc/environment file
 export HADOOP_LOG_DIR=$HADOOP_HOME/logs  

Troubleshooting

I know it’s generally not a desirable setting, but I find life somewhat easier if I use
 chmod 777  
rather than the more conservative
 chmod 755  
The latter would give me the following errors
 ./hadoop-daemon.sh: line 178: /usr/lib/apache/hadoop/2.5.1/logs/hadoop-craigtrim-datanode-CVB.out: Permission denied  
These appear to be mitigated with the use of the former command.


hadoop-env.sh


This file contains some environment variable settings used by Hadoop. The only variable you should need in this file is JAVA_HOME, which specifies the path to the Java installation used by Hadoop.

Content

The file didn't exist in 2.5.1, so I added it along with the expected variable:
 sudo gedit $HADOOP_HOME/conf/hadoop-env.sh  
 export JAVA_HOME=${JAVA_HOME}  
When this is script is called, it will simply reference the session JAVA_HOME variable.


core-site.xml

The purpose of the core-site.xml file is to
  1. Define the correct master node (this also occurs in mapred-site.xml).
  2. Define the location of HDFS using $HADOOP_DATA_DIR

Content

 <configuration>  
   <property>  
    <name>fs.defaultFS</name>  
    <!-- URL of MasterNode/NameNode -->  
    <value>hdfs://master:9000</value>  
   </property>  
 </configuration>  

You can choose to use an IP address rather than a host name, but as I've mentioned in this article, I prefer not to use the IP address.  Note that older implementations of hadoop use the property name "fs.default.name", but this doesn't work in 2.5.x.

hdfs-site.xml

This file contains configurations for both the NameNode and DataNodes.

Content

 <configuration>  
   <property>  
    <name>dfs.name.dir</name>  
    <!-- Path to store namespace and transaction logs -->  
    <value>file:///home/craigtrim/HADOOP_DATA_DIR/name</value>  
   </property>  
   <property>  
    <name>dfs.data.dir</name>  
    <!-- Path to store data blocks in datanode -->  
    <value>file:///home/craigtrim/HADOOP_DATA_DIR/data</value>  
   </property>  
 </configuration>  



mapred-site.xml


The map reduce local directory is the location used by Hadoop to store temporary files used. Add the JobTracker location to HADOOP_HOME/conf/mapred-site.xml. Hadoop will use this for the jobs. The final property will set the maximum map tasks per node.

Content

 <configuration>  
  <property>  
   <name>mapred.job.tracker</name>  
   <value>192.168.x.y:9001</value>  
  </property>  
  <property>  
   <name>mapred.local.dir</name>  
   <value>/home/craigtrim/HADOOP_DATA_DIR/local</value>  
  </property>  
  <property>  
   <name>mapred.tasktracker.map.tasks.maximum</name>  
   <value>8</value>  
  </property>  
 </configuration>  



capacity-scheduler.xml


From the official documentation, this configures a pluggable scheduler for Hadoop which allows for multiple-tenants to securely share a large cluster such that their applications are allocated resources in a timely manner under constraints of allocated capacities.

I'm going to add that using the default configuration has worked well for me so far, and that's all I know about this component at this time.

Content

I copied this default configuration into my directory:
default capacity-scheduler.xml (revision=1495684)
and this enabled my resourcemanager node to start successfully.


masters


The conf/masters file defines on which machines Hadoop will start secondary NameNodes in the cluster.  Either list the IP address (one per line), or update the /etc/hosts file to include a hostname with each IP address.


Troubleshooting



Relative vs Absolute Path

For some reason the use of the relative path modifier “~” to indicate the home directory fails here. There is a concatenation issue if that is used. So it’s necessary to use absolute paths.


Beware Silent Failures

Also, make sure you get the values absolutely right in these config files. You can get some really funky errors, or just have startup scripts die without errors if there is any problems. For example, my first time through I fat-fingered “hdsfs://” and my namenode start script didn’t work. But there were no errors!


Use of URIs

If you don’t use the file:/// prefix you’ll get this error:
 2014-11-19 16:49:35,270 WARN [main] common.Util (Util.java:stringAsURI(56)) - Path /home/craigtrim/HADOOP_DATA_DIR/name should be specified as a URI in configuration files. Please update hdfs configuration.  



References:
  1. http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/
    1. An excellent (all-inclusive) overview of running an HDFS cluster on Ubuntu
    2. http://mysolvedproblem.blogspot.com/2012/05/installing-hadoop-on-ubuntu-linux-on.html
      1. A reply to Michael Noll's original post with an example of errors you might get
  2. Setup
    1. [Apache Hadoop] org.apache.hadoop.conf.Configuration
      1. How Hadoop deals with Configuration files
    2. [Noobs Lab] Disabling IPv6
      1. Information on disabling IPv6 - quoted in this post.
      2. [Apache Hadoop] Hadoop and IPv6
        1. Why IPv6 and Hadoop and Ubuntu don't get along.
    3. The Hadoop Distributed File System
      1. Architectural overview of HDFS
  3. Running
    1. [Apache Hadoop] Starting the Hadoop Cluster
      1. Use of single scripts vs. the start-dfs and start-yarn aggregate scripts.
      2. [HortonWorks] Manually starting the Cluster
        1. Manually starting the cluster can be a great way of smoke-testing your installation.  The automated scripts are useful, but I recommend starting component manually at outset of a new installation.  This makes failiure at any point in the cluster startup easier to pinpoint.
  4. Troubleshooting
    1. [StackOverflow] No Datanodes are Running
      1. What happens when I start my cluster and no datanodes are running?


Next: Cloning for Clusters.

5 comments:

  1. This comment has been removed by a blog administrator.

    ReplyDelete
  2. Excellent and very cool idea and the subject at the top of magnificence and I am happy to this post..Interesting post! Thanks for writing it.What's wrong with this kind of post exactly? It follows your previous guideline for post length as well as clarity.
    Hadoop Training in Chennai

    ReplyDelete
  3. The blog is so interactive and Informative , you should write more blogs like this Hadoop Administration Online Training

    ReplyDelete
  4. This comment has been removed by the author.

    ReplyDelete
  5. Too good article,thank you for sharing this valuable information with us.
    keep sharing more posts with us.

    big data hadoop course

    hadoop admin training

    ReplyDelete