In some ways, I feel that this section represents the trickiest part in initially configuring the HDFS cluster. It's easy to get a property name or value wrong, or to leave something important out. Likewise, the names of properties often change. I really don't understand why Hadoop can't maintain backward compatibility, and what the rationale for seemingly minor changes like "fs.default.name" to "fs.defaultFS", etc.
Contents of this Article
- Disabling IPv6
- Directory Configuration
- Data (HADOOP_DATA_DIR)
- Configuration Files (HADOOP_CONF_DIR)
- Logs (HADOOP_LOG_DIR)
- Script Configuration
- hadoop-env.xml
- core-site.xml
- hdfs-site.xml
- mapred-site.xml
- capacity-scheduler.xml
- Other Files
- masters
- Start-up Scripts
Disabling IP v6
Apache Hadoop is not currently supported on IPv6 networks. It has only been tested and developed on IPv4 stacks. Hadoop needs IPv4 to work, and only IPv4 clients can talk to the cluster.If your organisation moves to IPv6 only, you will encounter problems. Some Linux releases default to being IPv6 only. That means unless the systems are configured to re-enable IPv4, some machines will break. As of Jan 2010, this was causing problems in Debian
Commands
sudo gedit /etc/sysctl.conf
Now add these lines at the end of file:
# IPv6
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1
Save and close the file and run this command: sudo sysctl -p
The purpose of this command is:
sysctl - configure kernel parameters at runtime
-p
Load in sysctl settings from the file specified or /etc/sysctl.conf if none given. Specifying - as filename means reading data from standard input.
HADOOP_DATA_DIR
A directory needs to exist on every machine for storing HDFS data. Determines where HDFS should sit. This is really important because HDFS should be sitting on a filesystem that is local to each cluster node
Commands
Edit the /etc/environment file export HADOOP_DATA_DIR=/home/craigtrim/HADOOP_DATA_DIR
In the home directory, run the following commands
mkdir -p $HADOOP_DATA_DIR/data
mkdir -p $HADOOP_DATA_DIR/name
mkdir -p $HADOOP_DATA_DIR/local
sudo chmod 755 $HADOOP_DATA_DIR
Troubleshooting
It’s important to use the absolute path in the /etc/environment file. For some reason, some of the other scripts that use this variable don’t do well with the “~” shortcut.HADOOP_CONF_DIR
A configuration folder needs to exist for configuration scripts.
Commands
mkdir $HADOOP_HOME/conf
sudo chmod 755 $HADOOP_HOME/conf
and then edit the /etc/environment file
export HADOOP_CONF_DIR=$HADOOP_HOME/conf
HADOOP_CONF_LOG
Determines where Hadoop's logs (including all job errors) should go. This is extremely important, as debugging job failures is impossible without looking at the logs kept hereCommands
mkdir $HADOOP_HOME/logs
sudo chmod 777 $HADOOP_HOME/logs
and then edit the /etc/environment file
export HADOOP_LOG_DIR=$HADOOP_HOME/logs
Troubleshooting
I know it’s generally not a desirable setting, but I find life somewhat easier if I use chmod 777
rather than the more conservative
chmod 755
The latter would give me the following errors
These appear to be mitigated with the use of the former command.
./hadoop-daemon.sh: line 178: /usr/lib/apache/hadoop/2.5.1/logs/hadoop-craigtrim-datanode-CVB.out: Permission denied
hadoop-env.sh
This file contains some environment variable settings used by Hadoop. The only variable you should need in this file is JAVA_HOME, which specifies the path to the Java installation used by Hadoop.
Content
The file didn't exist in 2.5.1, so I added it along with the expected variable: sudo gedit $HADOOP_HOME/conf/hadoop-env.sh
export JAVA_HOME=${JAVA_HOME}
When this is script is called, it will simply reference the session JAVA_HOME variable.
core-site.xml
The purpose of the core-site.xml file is to- Define the correct master node (this also occurs in mapred-site.xml).
- Define the location of HDFS using $HADOOP_DATA_DIR
Content
<configuration> <property> <name>fs.defaultFS</name> <!-- URL of MasterNode/NameNode --> <value>hdfs://master
:9000</value> </property> </configuration>
You can choose to use an IP address rather than a host name, but as I've mentioned in this article, I prefer not to use the IP address. Note that older implementations of hadoop use the property name "fs.default.name", but this doesn't work in 2.5.x.
hdfs-site.xml
This file contains configurations for both the NameNode and DataNodes.Content
<configuration> <property> <name>dfs.name.dir</name> <!-- Path to store namespace and transaction logs --> <value>file:///home/
craigtrim
/HADOOP_DATA_DIR/name</value> </property> <property> <name>dfs.data.dir</name> <!-- Path to store data blocks in datanode --> <value>file:///home/
craigtrim
/HADOOP_DATA_DIR/data</value> </property> </configuration>
mapred-site.xml
The map reduce local directory is the location used by Hadoop to store temporary files used. Add the JobTracker location to HADOOP_HOME/conf/mapred-site.xml. Hadoop will use this for the jobs. The final property will set the maximum map tasks per node.
Content
<configuration> <property> <name>mapred.job.tracker</name> <value>192.168.
x
.
y
:9001</value> </property> <property> <name>mapred.local.dir</name> <value>/home/
craigtrim
/HADOOP_DATA_DIR/local</value> </property> <property> <name>mapred.tasktracker.map.tasks.maximum</name> <value>8</value> </property> </configuration>
capacity-scheduler.xml
From the official documentation, this configures a pluggable scheduler for Hadoop which allows for multiple-tenants to securely share a large cluster such that their applications are allocated resources in a timely manner under constraints of allocated capacities.
I'm going to add that using the default configuration has worked well for me so far, and that's all I know about this component at this time.
Content
I copied this default configuration into my directory:default capacity-scheduler.xml (revision=1495684)and this enabled my resourcemanager node to start successfully.
masters
The conf/masters file defines on which machines Hadoop will start secondary NameNodes in the cluster. Either list the IP address (one per line), or update the /etc/hosts file to include a hostname with each IP address.
Troubleshooting
Relative vs Absolute Path
For some reason the use of the relative path modifier “~” to indicate the home directory fails here. There is a concatenation issue if that is used. So it’s necessary to use absolute paths.
Beware Silent Failures
Also, make sure you get the values absolutely right in these config files. You can get some really funky errors, or just have startup scripts die without errors if there is any problems.
For example, my first time through I fat-fingered “hdsfs://” and my namenode start script didn’t work. But there were no errors!
Use of URIs
If you don’t use the file:/// prefix you’ll get this error:
2014-11-19 16:49:35,270 WARN [main] common.Util (Util.java:stringAsURI(56)) - Path /home/craigtrim/HADOOP_DATA_DIR/name should be specified as a URI in configuration files. Please update hdfs configuration.
References:
- http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/
- An excellent (all-inclusive) overview of running an HDFS cluster on Ubuntu
- http://mysolvedproblem.blogspot.com/2012/05/installing-hadoop-on-ubuntu-linux-on.html
- A reply to Michael Noll's original post with an example of errors you might get
- Setup
- [Apache Hadoop] org.apache.hadoop.conf.Configuration
- How Hadoop deals with Configuration files
- [Noobs Lab] Disabling IPv6
- Information on disabling IPv6 - quoted in this post.
- [Apache Hadoop] Hadoop and IPv6
- Why IPv6 and Hadoop and Ubuntu don't get along.
- The Hadoop Distributed File System
- Architectural overview of HDFS
- Running
- [Apache Hadoop] Starting the Hadoop Cluster
- Use of single scripts vs. the start-dfs and start-yarn aggregate scripts.
- [HortonWorks] Manually starting the Cluster
- Manually starting the cluster can be a great way of smoke-testing your installation. The automated scripts are useful, but I recommend starting component manually at outset of a new installation. This makes failiure at any point in the cluster startup easier to pinpoint.
- Troubleshooting
- [StackOverflow] No Datanodes are Running
- What happens when I start my cluster and no datanodes are running?
Next: Cloning for Clusters.
This comment has been removed by a blog administrator.
ReplyDeleteExcellent and very cool idea and the subject at the top of magnificence and I am happy to this post..Interesting post! Thanks for writing it.What's wrong with this kind of post exactly? It follows your previous guideline for post length as well as clarity.
ReplyDeleteHadoop Training in Chennai
The blog is so interactive and Informative , you should write more blogs like this Hadoop Administration Online Training
ReplyDeleteThis comment has been removed by the author.
ReplyDeleteToo good article,thank you for sharing this valuable information with us.
ReplyDeletekeep sharing more posts with us.
big data hadoop course
hadoop admin training