Hadoop: Installing Hadoop on Ubuntu

This article is part of the Hadoop Masterpage.

This is not an article about configuring Hadoop on Unbuntu as a development environment. This article will you started with configuring Hadoop on Linux as a deployment environment (either single-node or clustered). To use Hadoop as a development article, I recommend using Maven with Eclipse (described in this article).

Where do I download Hadoop?

I start at the Apache Hadoop site here:

http://hadoop.apache.org/releases.html#Download

At the time of this article, I'm using Apache Hadoop 2.5.1.

Where should I install Hadoop?

If you are not familiar with the Linux directory layout, this article is recommended

http://www.nixtutor.com/linux/understanding-the-linux-directory-layout/

The binaries should be installed in the “usr” directory. This directory houses all the binaries, documentation, libraries, and header files for all the user applications. Most user binaries will be installed into this folder making this one of the largest folders in the Linux directory.

My preferred path is:

 /usr/lib/apache/hadoop/2.5.1/

This leaves opportunty to distinguish between various apache binaries and hadoop versions.

How do I install Hadoop?

Typically, the file will be downloaded to your “downloads” folder in your home directory. Using a terminal, following these steps:

sudo mkdir -p /usr/lib/apache/hadoop/2.5.1/   
   creates the directory path for the new installation   
   the –p flag creates the directory hierarchy   
   you’ll need to logon as the super user (su + do = sudo)   
cd ~/Downloads/   
   navigates to the downloads directory   
sudo mv hadoop* /usr/lib/apache/hadoop/2.5.1   
   move the downloaded file to the installation path   
cd /usr/lib/apache/hadoop/2.5.1/   
   go there yourself   
sudo tar -zxvf hadoop*   
   unpack the installation file   
sudo rm hadoop*   
   remove the installation file (optional)

Setting the Path

I recommend setting the path in the /etc/environment file.

This file is specifically meant for system-wide environment variable settings. It is not a script file, but rather consists of assignment expressions, one per line.

Open the file by typing:

 sudo gedit /etc/environment

and once completed, you should have something like this:

 PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/usr/lib/jvm/jdk/jdk1.8.0_25/bin/:/usr/lib/apache/maven/3.2.3/bin:/usr/lib/apache/hadoop/2.5.1/bin"  
 export PATH  
 export JAVA_HOME=/usr/lib/jvm/jdk/jdk1.8.0_25  
 export MAVEN_HOME=/usr/lib/apache/maven/3.2.3  
 export HADOOP_HOME=/usr/lib/apache/hadoop/2.5.1

Italicized text already exists.

To activate these path variables for the current terminal session, type

 source /etc/environment

If your path variables were set successfully, you should be able to type

 echo $HADOOP_HOME

in the terminal window and get back the path to your installation.

Permanently Setting the Path

To permanently modify the path, modify the .bashrc file.

.bashrc is a shell script that Bash runs whenever it is started interactively. You can put any command in that file that you could type at the command prompt. You put commands here to set up the shell for use in your particular environment, or to customize things to your preferences.

Follow this command sequence:

 gedit ~/.bashrc

and when the text editor comes up, add this line (anywhere):

 source /etc/environment

“sudo” privileges are not necessary, since this file is modified on a per-user basis.

To test your changes, exit your current command shell, and type

 echo $HADOOP_HOME

as you did in the prior example.

Update /etc/hosts

The hosts file is a computer file used by an operating system to map hostnames to IP addresses. The hosts file is a plain text file, and is conventionally named hosts.

It will be helpful, for the purpose of subsequent configuration, if we modify the hosts file on this VM instance to include this information:

 192.168.x.y     master

In later tutorials, as we're setting up the HDFS cluster, we're going to be modifying multiple configuration files.

And in these files, there's going to be a need to refer to different nodes in our cluster by their IP address. The Master (or NameNode) is one of these nodes. And it's going to be a lot easier if we can refer to the master node by the name "master" -- rather than by the IP address. IP addresses can change, so let's constrain this change to a single file. Otherwise, we're going to have re-modify our configuration files, and this can be a real pain, particularly when you're new to HDFS clustering, because you can forget to modify a file that actually ends up being very important.

Troubleshooting

If you make an error when editing the .bashrc file you can send Ubuntu into an infinite logon loop the next time you reboot. See the second reference link for more information.

Reference

New to Linux:

PATH Definition

PATH is an environmental variable in Linux and other Unix-like operating systems that tells the shell which directories to search for executable files in response to commands issued by a user. Environmental variables tell the shell how to behave as the user works at the command line. A user's PATH consists of a series of colon-separated absolute paths. The use of the PATH variable to find executable files eliminates the need for users to remember which directories they are in and to type their absolute path name.

/etc/environment in Ubuntu

A suitable file for environment variable settings that affect the system as a whole. This file is specifically meant for system-wide environment variable settings. It is not a script file, but rather consists of assignment expressions, one per line.

Purpose of .bashrc

.bashrc is a shell script that Bash runs whenever it is started interactively. You can put any command in that file that you could type at the command prompt.

Ubuntu Troubleshooting

Logon Loop

If .bashrc is modified incorrectly, Ubuntu can go into a logon loop.

Next: Enabling SSH.

Hadoop

Thursday, November 20, 2014

Installing Hadoop on Ubuntu