Hadoop: Configuring the HDFS Base Node

Perform a base configuration of a single Linux VM with Hadoop. Once the base configuration is complete, we’ll clone this VM into the “name node” and the “slave nodes”.

In some ways, I feel that this section represents the trickiest part in initially configuring the HDFS cluster. It's easy to get a property name or value wrong, or to leave something important out. Likewise, the names of properties often change. I really don't understand why Hadoop can't maintain backward compatibility, and what the rationale for seemingly minor changes like "fs.default.name" to "fs.defaultFS", etc.

Contents of this Article

Disabling IPv6
Directory Configuration
1. Data (HADOOP_DATA_DIR)
2. Configuration Files (HADOOP_CONF_DIR)
3. Logs (HADOOP_LOG_DIR)
Script Configuration
1. hadoop-env.xml
2. core-site.xml
3. hdfs-site.xml
4. mapred-site.xml
5. capacity-scheduler.xml
Other Files

masters

Start-up Scripts

Disabling IP v6

Apache Hadoop is not currently supported on IPv6 networks. It has only been tested and developed on IPv4 stacks. Hadoop needs IPv4 to work, and only IPv4 clients can talk to the cluster.If your organisation moves to IPv6 only, you will encounter problems. Some Linux releases default to being IPv6 only. That means unless the systems are configured to re-enable IPv4, some machines will break. As of Jan 2010, this was causing problems in Debian

Commands

 sudo gedit /etc/sysctl.conf

Now add these lines at the end of file:

 # IPv6  
 net.ipv6.conf.all.disable_ipv6 = 1  
 net.ipv6.conf.default.disable_ipv6 = 1  
 net.ipv6.conf.lo.disable_ipv6 = 1

Save and close the file and run this command:

 sudo sysctl -p

The purpose of this command is:

 sysctl - configure kernel parameters at runtime  
 -p  
 Load in sysctl settings from the file specified or /etc/sysctl.conf if none given. Specifying - as filename means reading data from standard input.

HADOOP_DATA_DIR

A directory needs to exist on every machine for storing HDFS data. Determines where HDFS should sit. This is really important because HDFS should be sitting on a filesystem that is local to each cluster node

Commands

Edit the /etc/environment file

 export HADOOP_DATA_DIR=/home/craigtrim/HADOOP_DATA_DIR

In the home directory, run the following commands

 mkdir -p $HADOOP_DATA_DIR/data  
 mkdir -p $HADOOP_DATA_DIR/name  
 mkdir -p $HADOOP_DATA_DIR/local  
 sudo chmod 755 $HADOOP_DATA_DIR

Troubleshooting

It’s important to use the absolute path in the /etc/environment file. For some reason, some of the other scripts that use this variable don’t do well with the “~” shortcut.

HADOOP_CONF_DIR

A configuration folder needs to exist for configuration scripts.

Commands

 mkdir $HADOOP_HOME/conf  
 sudo chmod 755 $HADOOP_HOME/conf

and then edit the /etc/environment file

 export HADOOP_CONF_DIR=$HADOOP_HOME/conf

HADOOP_CONF_LOG

Determines where Hadoop's logs (including all job errors) should go. This is extremely important, as debugging job failures is impossible without looking at the logs kept here

Commands

 mkdir $HADOOP_HOME/logs  
 sudo chmod 777 $HADOOP_HOME/logs

and then edit the /etc/environment file

 export HADOOP_LOG_DIR=$HADOOP_HOME/logs

Troubleshooting

I know it’s generally not a desirable setting, but I find life somewhat easier if I use

 chmod 777

rather than the more conservative

 chmod 755

The latter would give me the following errors

 ./hadoop-daemon.sh: line 178: /usr/lib/apache/hadoop/2.5.1/logs/hadoop-craigtrim-datanode-CVB.out: Permission denied

These appear to be mitigated with the use of the former command.

hadoop-env.sh

This file contains some environment variable settings used by Hadoop. The only variable you should need in this file is JAVA_HOME, which specifies the path to the Java installation used by Hadoop.

Content

The file didn't exist in 2.5.1, so I added it along with the expected variable:

 sudo gedit $HADOOP_HOME/conf/hadoop-env.sh  
 export JAVA_HOME=${JAVA_HOME}

When this is script is called, it will simply reference the session JAVA_HOME variable.

core-site.xml

The purpose of the core-site.xml file is to

Define the correct master node (this also occurs in mapred-site.xml).
Define the location of HDFS using $HADOOP_DATA_DIR

Content

 <configuration>  
   <property>  
    <name>fs.defaultFS</name>  
    <!-- URL of MasterNode/NameNode -->  
    <value>hdfs://master:9000</value>  
   </property>  
 </configuration>

You can choose to use an IP address rather than a host name, but as I've mentioned in this article, I prefer not to use the IP address. Note that older implementations of hadoop use the property name "fs.default.name", but this doesn't work in 2.5.x.

hdfs-site.xml

This file contains configurations for both the NameNode and DataNodes.

Content

 <configuration>  
   <property>  
    <name>dfs.name.dir</name>  
    <!-- Path to store namespace and transaction logs -->  
    <value>file:///home/craigtrim/HADOOP_DATA_DIR/name</value>  
   </property>  
   <property>  
    <name>dfs.data.dir</name>  
    <!-- Path to store data blocks in datanode -->  
    <value>file:///home/craigtrim/HADOOP_DATA_DIR/data</value>  
   </property>  
 </configuration>

mapred-site.xml

The map reduce local directory is the location used by Hadoop to store temporary files used. Add the JobTracker location to HADOOP_HOME/conf/mapred-site.xml. Hadoop will use this for the jobs. The final property will set the maximum map tasks per node.

Content

 <configuration>  
  <property>  
   <name>mapred.job.tracker</name>  
   <value>192.168.x.y:9001</value>  
  </property>  
  <property>  
   <name>mapred.local.dir</name>  
   <value>/home/craigtrim/HADOOP_DATA_DIR/local</value>  
  </property>  
  <property>  
   <name>mapred.tasktracker.map.tasks.maximum</name>  
   <value>8</value>  
  </property>  
 </configuration>

capacity-scheduler.xml

From the official documentation, this configures a pluggable scheduler for Hadoop which allows for multiple-tenants to securely share a large cluster such that their applications are allocated resources in a timely manner under constraints of allocated capacities.

I'm going to add that using the default configuration has worked well for me so far, and that's all I know about this component at this time.

Content

I copied this default configuration into my directory:

default capacity-scheduler.xml (revision=1495684)

and this enabled my resourcemanager node to start successfully.

masters

The conf/masters file defines on which machines Hadoop will start secondary NameNodes in the cluster. Either list the IP address (one per line), or update the /etc/hosts file to include a hostname with each IP address.

Troubleshooting

Relative vs Absolute Path

For some reason the use of the relative path modifier “~” to indicate the home directory fails here. There is a concatenation issue if that is used. So it’s necessary to use absolute paths.

Beware Silent Failures

Also, make sure you get the values absolutely right in these config files. You can get some really funky errors, or just have startup scripts die without errors if there is any problems. For example, my first time through I fat-fingered “hdsfs://” and my namenode start script didn’t work. But there were no errors!

Use of URIs

If you don’t use the file:/// prefix you’ll get this error:

 2014-11-19 16:49:35,270 WARN [main] common.Util (Util.java:stringAsURI(56)) - Path /home/craigtrim/HADOOP_DATA_DIR/name should be specified as a URI in configuration files. Please update hdfs configuration.

References:

http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/

An excellent (all-inclusive) overview of running an HDFS cluster on Ubuntu
http://mysolvedproblem.blogspot.com/2012/05/installing-hadoop-on-ubuntu-linux-on.html

A reply to Michael Noll's original post with an example of errors you might get

Setup

[Apache Hadoop] org.apache.hadoop.conf.Configuration

How Hadoop deals with Configuration files

[Noobs Lab] Disabling IPv6

Information on disabling IPv6 - quoted in this post.
[Apache Hadoop] Hadoop and IPv6

Why IPv6 and Hadoop and Ubuntu don't get along.

The Hadoop Distributed File System

Architectural overview of HDFS

Running

[Apache Hadoop] Starting the Hadoop Cluster

Use of single scripts vs. the start-dfs and start-yarn aggregate scripts.
[HortonWorks] Manually starting the Cluster

Manually starting the cluster can be a great way of smoke-testing your installation. The automated scripts are useful, but I recommend starting component manually at outset of a new installation. This makes failiure at any point in the cluster startup easier to pinpoint.

Troubleshooting

[StackOverflow] No Datanodes are Running

What happens when I start my cluster and no datanodes are running?

Next: Cloning for Clusters.

5 comments:

UnknownMay 11, 2015 at 1:04 AM
This comment has been removed by a blog administrator.
Karthika ShreeJune 8, 2017 at 1:18 AM
Excellent and very cool idea and the subject at the top of magnificence and I am happy to this post..Interesting post! Thanks for writing it.What's wrong with this kind of post exactly? It follows your previous guideline for post length as well as clarity.
Hadoop Training in Chennai
harikasri.blogspot.comOctober 19, 2018 at 1:53 AM
The blog is so interactive and Informative , you should write more blogs like this Hadoop Administration Online Training
AnanadJuly 5, 2020 at 3:49 AM
This comment has been removed by the author.
veera cynixitSeptember 29, 2020 at 1:29 AM
Too good article,thank you for sharing this valuable information with us.
keep sharing more posts with us.

big data hadoop course

hadoop admin training

Thursday, November 20, 2014

Configuring the HDFS Base Node

Contents of this Article

Disabling IP v6

Commands

HADOOP_DATA_DIR

Commands

Troubleshooting

HADOOP_CONF_DIR

Commands

HADOOP_CONF_LOG

Commands

Troubleshooting

hadoop-env.sh

Content

core-site.xml

Content

hdfs-site.xml

Content

mapred-site.xml

Content

capacity-scheduler.xml

Content

masters

Troubleshooting

Relative vs Absolute Path

Beware Silent Failures

Use of URIs

5 comments: