Thursday, November 20, 2014

Configuring a Hadoop Development Environment

This article is part of the Hadoop Masterpage.


When starting with Hadoop, it’s helpful to think of installation and configuration as having two facets:
  1. for deployment
    1. single-node
    2. clustered
  2. for development
It is possible that if one is planning to use a cloud-based solution (e.g. IBM's Softlayer) for deployment of MapReduce algorithms, only development configuration and installation is necessary. 

Working with Hadoop for deployment is becoming increasingly uncommon as cloud-based solutions grow in popularity. Given the myriad of hadoop and linux distributions, and the number and variety of configuration scenarios, it’s far easier to configure a version of hadoop on a laptop, and simply use a tool like Maven to automatically generate a JAR file which can then be uploaded to a remote site and deployed in a cloud environment.

It is this second option – installing and configuring Maven for local development – that this tutorial will cover.


Apache Maven


This tutorial assumes familiarity with Apache Maven for building a project. My article Installing Maven on Ubuntu covers installation of Maven (version 3.2.3) on Linux (Ubuntu 14.04).


Initializing the Java Project


Under my home directory I’ve created a workspaces directory, and a test-workspace sub-directory under that:
 mkdir –p ~/workspaces/hadoop/  
 cd ~/workspaces/hadoop/  

Within this test-workspace, I’m going to create my project to work with Hadoop:
 mvn archetype:generate -DgroupId=dev.hadoop.sandbox -DartifactId=sandbox -DarchetypeArtifactId=maven-archetype-quickstart -DinteractiveMode=false  
This will create a project (artifactId = hadoop-dev) with an initial package structure (groupId = dev.hadoop).


Operational Output

When this command is executed for the first time, various files necessary to Maven will be downloaded.  Depending on your internet connection, this may take a while.

Ultimately, you should end up with something like this:
 craigtrim@CVB:~/workspace/hadoop$ mvn archetype:generate -DgroupId=dev.hadoop.sandbox -DartifactId=sandbox -DarchetypeArtifactId=maven-archetype-quickstart -DinteractiveMode=false  
 [INFO] Scanning for projects...  
 [INFO]                                       
 [INFO] ------------------------------------------------------------------------  
 [INFO] Building Maven Stub Project (No POM) 1  
 [INFO] ------------------------------------------------------------------------  
 [INFO]   
 [INFO] >>> maven-archetype-plugin:2.2:generate (default-cli) > generate-sources @ standalone-pom >>>  
 [INFO]   
 [INFO] <<< maven-archetype-plugin:2.2:generate (default-cli) < generate-sources @ standalone-pom <<<  
 [INFO]   
 [INFO] --- maven-archetype-plugin:2.2:generate (default-cli) @ standalone-pom ---  
 [INFO] Generating project in Batch mode  
 [INFO] ----------------------------------------------------------------------------  
 [INFO] Using following parameters for creating project from Old (1.x) Archetype: maven-archetype-quickstart:1.0  
 [INFO] ----------------------------------------------------------------------------  
 [INFO] Parameter: basedir, Value: /home/craigtrim/workspace/hadoop  
 [INFO] Parameter: package, Value: dev.hadoop.sandbox  
 [INFO] Parameter: groupId, Value: dev.hadoop.sandbox  
 [INFO] Parameter: artifactId, Value: sandbox  
 [INFO] Parameter: packageName, Value: dev.hadoop.sandbox  
 [INFO] Parameter: version, Value: 1.0-SNAPSHOT  
 [INFO] project created from Old (1.x) Archetype in dir: /home/craigtrim/workspace/hadoop/sandbox  
 [INFO] ------------------------------------------------------------------------  
 [INFO] BUILD SUCCESS  
 [INFO] ------------------------------------------------------------------------  
 [INFO] Total time: 1.993 s  
 [INFO] Finished at: 2014-11-24T15:57:47-08:00  
 [INFO] Final Memory: 13M/241M  
 [INFO] ------------------------------------------------------------------------  
 craigtrim@CVB:~/workspace/hadoop$   

Look for the line in your terminal output that reads "BUILD SUCCESS".


Updating the POM


Next, navigate into the project
 cd sandbox  

and edit the pom.xml file with the text editor of your choice.
 sudo gedit pom.xml  

You will want to add the Maven Dependencies for Hadoop.

My final POM looks like this:
 <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"  
      xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">  
      <modelVersion>4.0.0</modelVersion>  
      <groupId>dev.hadoop.sandbox</groupId>  
      <artifactId>sandbox</artifactId>  
      <packaging>jar</packaging>  
      <version>1.0-SNAPSHOT</version>  
      <name>sandbox</name>  
      <url>http://maven.apache.org</url>  
      <dependencies>  
           <dependency>  
                <groupId>junit</groupId>  
                <artifactId>junit</artifactId>  
                <version>3.8.1</version>  
                <scope>test</scope>  
           </dependency>  
           <dependency>  
                <groupId>org.apache.hadoop</groupId>  
                <artifactId>hadoop-core</artifactId>  
                <version>1.2.1</version>  
           </dependency>  
           <dependency>  
                <groupId>org.apache.hadoop</groupId>  
                <artifactId>hadoop-client</artifactId>  
                <version>2.5.2</version>  
           </dependency>  
           <dependency>  
                <groupId>org.apache.hadoop</groupId>  
                <artifactId>hadoop-hdfs</artifactId>  
                <version>2.5.2</version>  
           </dependency>  
           <dependency>  
                <groupId>org.apache.hadoop</groupId>  
                <artifactId>hadoop-common</artifactId>  
                <version>2.5.2</version>  
           </dependency>  
      </dependencies>  
 </project>  



Package the Project


Now we're going to execute what's known as a a project phase.  A phase is a step in the build lifecycle.

Navigate into the project
 cd /home/craigtrim/workspace/hadoop/sandbox  

and run this command:
 mvn package  

This is the "package command" and will instruct Maven to take the compiled code and package it in its distributable format, such as a JAR.

This command will test the JAR file:
 java -cp target/sandbox-1.0-SNAPSHOT.jar dev.hadoop.sandbox.App  


Operational Output

Successful output on my workstation looks like this:
 craigtrim@CVB:~$ cd ~/workspace/hadoop/  
 craigtrim@CVB:~/workspace/hadoop$ mvn package  
 [INFO] Scanning for projects...  
 [INFO] ------------------------------------------------------------------------  
 [INFO] BUILD FAILURE  
 [INFO] ------------------------------------------------------------------------  
 [INFO] Total time: 0.130 s  
 [INFO] Finished at: 2014-11-24T16:02:56-08:00  
 [INFO] Final Memory: 5M/241M  
 [INFO] ------------------------------------------------------------------------  
 [ERROR] The goal you specified requires a project to execute but there is no POM in this directory (/home/craigtrim/workspace/hadoop). Please verify you invoked Maven from the correct directory. -> [Help 1]  
 [ERROR]   
 [ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.  
 [ERROR] Re-run Maven using the -X switch to enable full debug logging.  
 [ERROR]   
 [ERROR] For more information about the errors and possible solutions, please read the following articles:  
 [ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MissingProjectException  
 craigtrim@CVB:~/workspace/hadoop$ clear  
 craigtrim@CVB:~/workspace/hadoop$ cd sandbox/  
 craigtrim@CVB:~/workspace/hadoop/sandbox$ mvn package  
 [INFO] Scanning for projects...  
 [INFO]                                       
 [INFO] ------------------------------------------------------------------------  
 [INFO] Building sandbox 1.0-SNAPSHOT  
 [INFO] ------------------------------------------------------------------------  
 [INFO]   
 [INFO] --- maven-resources-plugin:2.6:resources (default-resources) @ sandbox ---  
 [WARNING] Using platform encoding (UTF-8 actually) to copy filtered resources, i.e. build is platform dependent!  
 [INFO] skip non existing resourceDirectory /home/craigtrim/workspace/hadoop/sandbox/src/main/resources  
 [INFO]   
 [INFO] --- maven-compiler-plugin:3.1:compile (default-compile) @ sandbox ---  
 [INFO] Changes detected - recompiling the module!  
 [WARNING] File encoding has not been set, using platform encoding UTF-8, i.e. build is platform dependent!  
 [INFO] Compiling 1 source file to /home/craigtrim/workspace/hadoop/sandbox/target/classes  
 [INFO]   
 [INFO] --- maven-resources-plugin:2.6:testResources (default-testResources) @ sandbox ---  
 [WARNING] Using platform encoding (UTF-8 actually) to copy filtered resources, i.e. build is platform dependent!  
 [INFO] skip non existing resourceDirectory /home/craigtrim/workspace/hadoop/sandbox/src/test/resources  
 [INFO]   
 [INFO] --- maven-compiler-plugin:3.1:testCompile (default-testCompile) @ sandbox ---  
 [INFO] Changes detected - recompiling the module!  
 [WARNING] File encoding has not been set, using platform encoding UTF-8, i.e. build is platform dependent!  
 [INFO] Compiling 1 source file to /home/craigtrim/workspace/hadoop/sandbox/target/test-classes  
 [INFO]   
 [INFO] --- maven-surefire-plugin:2.12.4:test (default-test) @ sandbox ---  
 [INFO] Surefire report directory: /home/craigtrim/workspace/hadoop/sandbox/target/surefire-reports  
 -------------------------------------------------------  
  T E S T S  
 -------------------------------------------------------  
 Running dev.hadoop.sandbox.AppTest  
 Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.011 sec  
 Results :  
 Tests run: 1, Failures: 0, Errors: 0, Skipped: 0  
 [INFO]   
 [INFO] --- maven-jar-plugin:2.4:jar (default-jar) @ sandbox ---  
 [INFO] Building jar: /home/craigtrim/workspace/hadoop/sandbox/target/sandbox-1.0-SNAPSHOT.jar  
 [INFO] ------------------------------------------------------------------------  
 [INFO] BUILD SUCCESS  
 [INFO] ------------------------------------------------------------------------  
 [INFO] Total time: 2.993 s  
 [INFO] Finished at: 2014-11-24T16:03:13-08:00  
 [INFO] Final Memory: 25M/319M  
 [INFO] ------------------------------------------------------------------------  
 craigtrim@CVB:~/workspace/hadoop/sandbox$   

Keep in mind, I've instructed Maven to download Hadoop JARs in the past.

The output I have shared above occurs when running this command on subsequent iterations. The first time a new dependency is added to a POM file, Maven will seek out all the direct and indirect dependencies, and download them to your local repository.  Check toward the end of this article for repository navigation and dependency visualization.


Preparing for Eclipse


Once the packaging phase has completed, we need to prepare the project for use within Eclipse.

All the Eclipse standard meta-data required to support this project within the IDE can be generated using this command:
 mvn eclipse:eclipse  

from within the project (same location as the mvn package) command.


Operational Output

Successful output on my workstation looks like this:
 craigtrim@CVB:~/workspace/hadoop/sandbox$ mvn eclipse:eclipse  
 [INFO] Scanning for projects...  
 [INFO]                                       
 [INFO] ------------------------------------------------------------------------  
 [INFO] Building sandbox 1.0-SNAPSHOT  
 [INFO] ------------------------------------------------------------------------  
 [INFO]   
 [INFO] >>> maven-eclipse-plugin:2.9:eclipse (default-cli) > generate-resources @ sandbox >>>  
 [INFO]   
 [INFO] <<< maven-eclipse-plugin:2.9:eclipse (default-cli) < generate-resources @ sandbox <<<  
 [INFO]   
 [INFO] --- maven-eclipse-plugin:2.9:eclipse (default-cli) @ sandbox ---  
 [INFO] Using Eclipse Workspace: /home/craigtrim/workspace/hadoop  
 [INFO] Adding default classpath container: org.eclipse.jdt.launching.JRE_CONTAINER  
 [INFO] Not writing settings - defaults suffice  
 [INFO] Wrote Eclipse project for "sandbox" to /home/craigtrim/workspace/hadoop/sandbox.  
 [INFO]   
 [INFO] ------------------------------------------------------------------------  
 [INFO] BUILD SUCCESS  
 [INFO] ------------------------------------------------------------------------  
 [INFO] Total time: 2.819 s  
 [INFO] Finished at: 2014-11-24T16:32:20-08:00  
 [INFO] Final Memory: 14M/431M  
 [INFO] ------------------------------------------------------------------------  
 craigtrim@CVB:~/workspace/hadoop/sandbox$  



Running Eclipse


Open Eclipse at the workspace root:


Once inside, you'll need to import the "sandbox" project using
 File > Import > Existing Projects into Workspace  


Build Phases


While this article is not strictly a Maven tutorial, it's helpful to understand what happens when "mvn package" is executed on the command line.

The package command will execute the following build phases that preceded it:
 validate - validate the project is correct and all necessary information is available  
 compile - compile the source code of the project  
 test - test the compiled source code using a suitable unit testing framework. These tests should not require the code be packaged or deployed  
 package - take the compiled code and package it in its distributable format, such as a JAR.  
 integration-test - process and deploy the package if necessary into an environment where integration tests can be run  
 verify - run any checks to verify the package is valid and meets quality criteria  
 install - install the package into the local repository, for use as a dependency in other projects locally  
 deploy - done in an integration or release environment, copies the final package to the remote repository for sharing with other developers and projects.  



Repository Navigation


In Ubuntu, I can view the local Maven repository by opening a file browser and navigating to my Home directory. Once there, I use the CTRL+H command to show hidden files and folders. The .m2 folder should appear and look something like this:


The folder above contains many of the more direct dependencies required by the four dependencies I've added to the POM file.


Dependency Visualization

To configure Eclipse for Hadoop development, we had to specify four dependencies in the Maven POM file.

It turns out, there's exactly 84 JAR files (at the time of this article) required to support this configuration. Maven prevents us from having to track down each of these 84 dependencies, most of them indirect (to the nth degree!) by downloading them for us, and resolving any conflicts automagically.

To see what's really going on, there's a handy tree visualization command that we can use.

Run this in the same directory as your project:
 mvn dependency:tree  


Operational Output

Successful output on my workstation looks like this:
  org.test.hadoop:hadoop:jar:1.0-SNAPSHOT   
  +- junit:junit:jar:3.8.1:test   
  +- org.apache.hadoop:hadoop-core:jar:1.2.1:compile   
  | +- commons-cli:commons-cli:jar:1.2:compile   
  | +- xmlenc:xmlenc:jar:0.52:compile   
  | +- com.sun.jersey:jersey-core:jar:1.8:compile   
  | +- com.sun.jersey:jersey-json:jar:1.8:compile   
  | | +- org.codehaus.jettison:jettison:jar:1.1:compile   
  | | | \- stax:stax-api:jar:1.0.1:compile   
  | | +- com.sun.xml.bind:jaxb-impl:jar:2.2.3-1:compile   
  | | | \- javax.xml.bind:jaxb-api:jar:2.2.2:compile   
  | | |  +- javax.xml.stream:stax-api:jar:1.0-2:compile   
  | | |  \- javax.activation:activation:jar:1.1:compile   
  | | +- org.codehaus.jackson:jackson-jaxrs:jar:1.7.1:compile   
  | | \- org.codehaus.jackson:jackson-xc:jar:1.7.1:compile   
  | +- com.sun.jersey:jersey-server:jar:1.8:compile   
  | | \- asm:asm:jar:3.1:compile   
  | +- commons-io:commons-io:jar:2.1:compile   
  | +- commons-httpclient:commons-httpclient:jar:3.0.1:compile   
  | +- commons-codec:commons-codec:jar:1.4:compile   
  | +- org.apache.commons:commons-math:jar:2.1:compile   
  | +- commons-configuration:commons-configuration:jar:1.6:compile   
  | | +- commons-digester:commons-digester:jar:1.8:compile   
  | | | \- commons-beanutils:commons-beanutils:jar:1.7.0:compile   
  | | \- commons-beanutils:commons-beanutils-core:jar:1.8.0:compile   
  | +- commons-net:commons-net:jar:1.4.1:compile   
  | +- org.mortbay.jetty:jetty:jar:6.1.26:compile   
  | | \- org.mortbay.jetty:servlet-api:jar:2.5-20081211:compile   
  | +- org.mortbay.jetty:jetty-util:jar:6.1.26:compile   
  | +- tomcat:jasper-runtime:jar:5.5.12:compile   
  | +- tomcat:jasper-compiler:jar:5.5.12:compile   
  | +- org.mortbay.jetty:jsp-api-2.1:jar:6.1.14:compile   
  | | \- org.mortbay.jetty:servlet-api-2.5:jar:6.1.14:compile   
  | +- org.mortbay.jetty:jsp-2.1:jar:6.1.14:compile   
  | | \- ant:ant:jar:1.6.5:compile   
  | +- commons-el:commons-el:jar:1.0:compile   
  | +- net.java.dev.jets3t:jets3t:jar:0.6.1:compile   
  | +- hsqldb:hsqldb:jar:1.8.0.10:compile   
  | +- oro:oro:jar:2.0.8:compile   
  | +- org.eclipse.jdt:core:jar:3.1.1:compile   
  | \- org.codehaus.jackson:jackson-mapper-asl:jar:1.8.8:compile   
  +- org.apache.hadoop:hadoop-client:jar:2.5.2:compile   
  | +- org.apache.hadoop:hadoop-mapreduce-client-app:jar:2.5.2:compile   
  | | +- org.apache.hadoop:hadoop-mapreduce-client-common:jar:2.5.2:compile   
  | | | +- org.apache.hadoop:hadoop-yarn-client:jar:2.5.2:compile   
  | | | | \- com.sun.jersey:jersey-client:jar:1.9:compile   
  | | | \- org.apache.hadoop:hadoop-yarn-server-common:jar:2.5.2:compile   
  | | \- org.apache.hadoop:hadoop-mapreduce-client-shuffle:jar:2.5.2:compile   
  | |  \- org.fusesource.leveldbjni:leveldbjni-all:jar:1.8:compile   
  | +- org.apache.hadoop:hadoop-yarn-api:jar:2.5.2:compile   
  | +- org.apache.hadoop:hadoop-mapreduce-client-core:jar:2.5.2:compile   
  | | \- org.apache.hadoop:hadoop-yarn-common:jar:2.5.2:compile   
  | +- org.apache.hadoop:hadoop-mapreduce-client-jobclient:jar:2.5.2:compile   
  | \- org.apache.hadoop:hadoop-annotations:jar:2.5.2:compile   
  +- org.apache.hadoop:hadoop-hdfs:jar:2.5.2:compile   
  | +- com.google.guava:guava:jar:11.0.2:compile   
  | +- commons-lang:commons-lang:jar:2.6:compile   
  | +- commons-logging:commons-logging:jar:1.1.3:compile   
  | +- commons-daemon:commons-daemon:jar:1.0.13:compile   
  | +- javax.servlet.jsp:jsp-api:jar:2.1:compile   
  | +- log4j:log4j:jar:1.2.17:compile   
  | +- com.google.protobuf:protobuf-java:jar:2.5.0:compile   
  | +- javax.servlet:servlet-api:jar:2.5:compile   
  | +- org.codehaus.jackson:jackson-core-asl:jar:1.9.13:compile   
  | \- io.netty:netty:jar:3.6.2.Final:compile   
  \- org.apache.hadoop:hadoop-common:jar:2.5.2:compile   
   +- org.apache.commons:commons-math3:jar:3.1.1:compile   
   +- commons-collections:commons-collections:jar:3.2.1:compile   
   +- org.slf4j:slf4j-api:jar:1.7.5:compile   
   +- org.slf4j:slf4j-log4j12:jar:1.7.5:compile   
   +- org.apache.avro:avro:jar:1.7.4:compile   
   | +- com.thoughtworks.paranamer:paranamer:jar:2.3:compile   
   | \- org.xerial.snappy:snappy-java:jar:1.0.4.1:compile   
   +- org.apache.hadoop:hadoop-auth:jar:2.5.2:compile   
   | +- org.apache.httpcomponents:httpclient:jar:4.2.5:compile   
   | | \- org.apache.httpcomponents:httpcore:jar:4.2.4:compile   
   | \- org.apache.directory.server:apacheds-kerberos-codec:jar:2.0.0-M15:compile   
   |  +- org.apache.directory.server:apacheds-i18n:jar:2.0.0-M15:compile   
   |  +- org.apache.directory.api:api-asn1-api:jar:1.0.0-M20:compile   
   |  \- org.apache.directory.api:api-util:jar:1.0.0-M20:compile   
   +- com.jcraft:jsch:jar:0.1.42:compile   
   +- com.google.code.findbugs:jsr305:jar:1.3.9:compile   
   +- org.apache.zookeeper:zookeeper:jar:3.4.6:compile   
   \- org.apache.commons:commons-compress:jar:1.4.1:compile   
   \- org.tukaani:xz:jar:1.0:compile   

The effort to establish all of those dependencies manually would be immense.  Manual dependency resolution is possible, but extremely difficult and error prone.

7 comments: