When starting with Hadoop, it’s helpful to think of installation and configuration as having two facets:
- for deployment
- single-node
- clustered
- for development
Working with Hadoop for deployment is becoming increasingly uncommon as cloud-based solutions grow in popularity. Given the myriad of hadoop and linux distributions, and the number and variety of configuration scenarios, it’s far easier to configure a version of hadoop on a laptop, and simply use a tool like Maven to automatically generate a JAR file which can then be uploaded to a remote site and deployed in a cloud environment.
It is this second option – installing and configuring Maven for local development – that this tutorial will cover.
Apache Maven
This tutorial assumes familiarity with Apache Maven for building a project. My article Installing Maven on Ubuntu covers installation of Maven (version 3.2.3) on Linux (Ubuntu 14.04).
Initializing the Java Project
Under my home directory I’ve created a workspaces directory, and a test-workspace sub-directory under that:
mkdir –p ~/workspaces/hadoop/
cd ~/workspaces/hadoop/
Within this test-workspace, I’m going to create my project to work with Hadoop:
This will create a project (artifactId = hadoop-dev) with an initial package structure (groupId = dev.hadoop).mvn archetype:generate -DgroupId=
dev.hadoop.sandbox
-DartifactId=
sandbox
-DarchetypeArtifactId=maven-archetype-quickstart -DinteractiveMode=false
Operational Output
When this command is executed for the first time, various files necessary to Maven will be downloaded. Depending on your internet connection, this may take a while.Ultimately, you should end up with something like this:
craigtrim@CVB:~/workspace/hadoop$ mvn archetype:generate -DgroupId=dev.hadoop.sandbox -DartifactId=sandbox -DarchetypeArtifactId=maven-archetype-quickstart -DinteractiveMode=false
[INFO] Scanning for projects...
[INFO]
[INFO] ------------------------------------------------------------------------
[INFO] Building Maven Stub Project (No POM) 1
[INFO] ------------------------------------------------------------------------
[INFO]
[INFO] >>> maven-archetype-plugin:2.2:generate (default-cli) > generate-sources @ standalone-pom >>>
[INFO]
[INFO] <<< maven-archetype-plugin:2.2:generate (default-cli) < generate-sources @ standalone-pom <<<
[INFO]
[INFO] --- maven-archetype-plugin:2.2:generate (default-cli) @ standalone-pom ---
[INFO] Generating project in Batch mode
[INFO] ----------------------------------------------------------------------------
[INFO] Using following parameters for creating project from Old (1.x) Archetype: maven-archetype-quickstart:1.0
[INFO] ----------------------------------------------------------------------------
[INFO] Parameter: basedir, Value: /home/craigtrim/workspace/hadoop
[INFO] Parameter: package, Value: dev.hadoop.sandbox
[INFO] Parameter: groupId, Value: dev.hadoop.sandbox
[INFO] Parameter: artifactId, Value: sandbox
[INFO] Parameter: packageName, Value: dev.hadoop.sandbox
[INFO] Parameter: version, Value: 1.0-SNAPSHOT
[INFO] project created from Old (1.x) Archetype in dir: /home/craigtrim/workspace/hadoop/sandbox
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 1.993 s
[INFO] Finished at: 2014-11-24T15:57:47-08:00
[INFO] Final Memory: 13M/241M
[INFO] ------------------------------------------------------------------------
craigtrim@CVB:~/workspace/hadoop$
Look for the line in your terminal output that reads "BUILD SUCCESS".
Updating the POM
Next, navigate into the project
cd sandbox
and edit the pom.xml file with the text editor of your choice.
sudo gedit pom.xml
You will want to add the Maven Dependencies for Hadoop.
My final POM looks like this:
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd"> <modelVersion>4.0.0</modelVersion> <groupId>dev.hadoop.sandbox</groupId> <artifactId>sandbox</artifactId> <packaging>jar</packaging> <version>1.0-SNAPSHOT</version> <name>sandbox</name> <url>http://maven.apache.org</url> <dependencies> <dependency> <groupId>junit</groupId> <artifactId>junit</artifactId> <version>3.8.1</version> <scope>test</scope> </dependency>
<dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-core</artifactId> <version>1.2.1</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-client</artifactId> <version>2.5.2</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-hdfs</artifactId> <version>2.5.2</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-common</artifactId> <version>2.5.2</version> </dependency>
</dependencies> </project>
Package the Project
Now we're going to execute what's known as a a project phase. A phase is a step in the build lifecycle.
Navigate into the project
cd /home/craigtrim/workspace/hadoop/sandbox
and run this command:
mvn package
This is the "package command" and will instruct Maven to take the compiled code and package it in its distributable format, such as a JAR.
This command will test the JAR file:
java -cp target/sandbox-1.0-SNAPSHOT.jar dev.hadoop.sandbox.App
Operational Output
Successful output on my workstation looks like this: craigtrim@CVB:~$ cd ~/workspace/hadoop/
craigtrim@CVB:~/workspace/hadoop$ mvn package
[INFO] Scanning for projects...
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 0.130 s
[INFO] Finished at: 2014-11-24T16:02:56-08:00
[INFO] Final Memory: 5M/241M
[INFO] ------------------------------------------------------------------------
[ERROR] The goal you specified requires a project to execute but there is no POM in this directory (/home/craigtrim/workspace/hadoop). Please verify you invoked Maven from the correct directory. -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MissingProjectException
craigtrim@CVB:~/workspace/hadoop$ clear
craigtrim@CVB:~/workspace/hadoop$ cd sandbox/
craigtrim@CVB:~/workspace/hadoop/sandbox$ mvn package
[INFO] Scanning for projects...
[INFO]
[INFO] ------------------------------------------------------------------------
[INFO] Building sandbox 1.0-SNAPSHOT
[INFO] ------------------------------------------------------------------------
[INFO]
[INFO] --- maven-resources-plugin:2.6:resources (default-resources) @ sandbox ---
[WARNING] Using platform encoding (UTF-8 actually) to copy filtered resources, i.e. build is platform dependent!
[INFO] skip non existing resourceDirectory /home/craigtrim/workspace/hadoop/sandbox/src/main/resources
[INFO]
[INFO] --- maven-compiler-plugin:3.1:compile (default-compile) @ sandbox ---
[INFO] Changes detected - recompiling the module!
[WARNING] File encoding has not been set, using platform encoding UTF-8, i.e. build is platform dependent!
[INFO] Compiling 1 source file to /home/craigtrim/workspace/hadoop/sandbox/target/classes
[INFO]
[INFO] --- maven-resources-plugin:2.6:testResources (default-testResources) @ sandbox ---
[WARNING] Using platform encoding (UTF-8 actually) to copy filtered resources, i.e. build is platform dependent!
[INFO] skip non existing resourceDirectory /home/craigtrim/workspace/hadoop/sandbox/src/test/resources
[INFO]
[INFO] --- maven-compiler-plugin:3.1:testCompile (default-testCompile) @ sandbox ---
[INFO] Changes detected - recompiling the module!
[WARNING] File encoding has not been set, using platform encoding UTF-8, i.e. build is platform dependent!
[INFO] Compiling 1 source file to /home/craigtrim/workspace/hadoop/sandbox/target/test-classes
[INFO]
[INFO] --- maven-surefire-plugin:2.12.4:test (default-test) @ sandbox ---
[INFO] Surefire report directory: /home/craigtrim/workspace/hadoop/sandbox/target/surefire-reports
-------------------------------------------------------
T E S T S
-------------------------------------------------------
Running dev.hadoop.sandbox.AppTest
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.011 sec
Results :
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0
[INFO]
[INFO] --- maven-jar-plugin:2.4:jar (default-jar) @ sandbox ---
[INFO] Building jar: /home/craigtrim/workspace/hadoop/sandbox/target/sandbox-1.0-SNAPSHOT.jar
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 2.993 s
[INFO] Finished at: 2014-11-24T16:03:13-08:00
[INFO] Final Memory: 25M/319M
[INFO] ------------------------------------------------------------------------
craigtrim@CVB:~/workspace/hadoop/sandbox$
Keep in mind, I've instructed Maven to download Hadoop JARs in the past.
The output I have shared above occurs when running this command on subsequent iterations. The first time a new dependency is added to a POM file, Maven will seek out all the direct and indirect dependencies, and download them to your local repository. Check toward the end of this article for repository navigation and dependency visualization.
Preparing for Eclipse
Once the packaging phase has completed, we need to prepare the project for use within Eclipse.
All the Eclipse standard meta-data required to support this project within the IDE can be generated using this command:
mvn eclipse:eclipse
from within the project (same location as the mvn package) command.
Operational Output
Successful output on my workstation looks like this: craigtrim@CVB:~/workspace/hadoop/sandbox$ mvn eclipse:eclipse
[INFO] Scanning for projects...
[INFO]
[INFO] ------------------------------------------------------------------------
[INFO] Building sandbox 1.0-SNAPSHOT
[INFO] ------------------------------------------------------------------------
[INFO]
[INFO] >>> maven-eclipse-plugin:2.9:eclipse (default-cli) > generate-resources @ sandbox >>>
[INFO]
[INFO] <<< maven-eclipse-plugin:2.9:eclipse (default-cli) < generate-resources @ sandbox <<<
[INFO]
[INFO] --- maven-eclipse-plugin:2.9:eclipse (default-cli) @ sandbox ---
[INFO] Using Eclipse Workspace: /home/craigtrim/workspace/hadoop
[INFO] Adding default classpath container: org.eclipse.jdt.launching.JRE_CONTAINER
[INFO] Not writing settings - defaults suffice
[INFO] Wrote Eclipse project for "sandbox" to /home/craigtrim/workspace/hadoop/sandbox.
[INFO]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 2.819 s
[INFO] Finished at: 2014-11-24T16:32:20-08:00
[INFO] Final Memory: 14M/431M
[INFO] ------------------------------------------------------------------------
craigtrim@CVB:~/workspace/hadoop/sandbox$
Running Eclipse
Open Eclipse at the workspace root:
Once inside, you'll need to import the "sandbox" project using
File > Import > Existing Projects into Workspace
Build Phases
While this article is not strictly a Maven tutorial, it's helpful to understand what happens when "mvn package" is executed on the command line.
The package command will execute the following build phases that preceded it:
validate - validate the project is correct and all necessary information is available compile - compile the source code of the project test - test the compiled source code using a suitable unit testing framework. These tests should not require the code be packaged or deployed package - take the compiled code and package it in its distributable format, such as a JAR.
integration-test - process and deploy the package if necessary into an environment where integration tests can be run verify - run any checks to verify the package is valid and meets quality criteria install - install the package into the local repository, for use as a dependency in other projects locally deploy - done in an integration or release environment, copies the final package to the remote repository for sharing with other developers and projects.
Repository Navigation
In Ubuntu, I can view the local Maven repository by opening a file browser and navigating to my Home directory. Once there, I use the CTRL+H command to show hidden files and folders. The .m2 folder should appear and look something like this:
The folder above contains many of the more direct dependencies required by the four dependencies I've added to the POM file.
Dependency Visualization
To configure Eclipse for Hadoop development, we had to specify four dependencies in the Maven POM file.It turns out, there's exactly 84 JAR files (at the time of this article) required to support this configuration. Maven prevents us from having to track down each of these 84 dependencies, most of them indirect (to the nth degree!) by downloading them for us, and resolving any conflicts automagically.
To see what's really going on, there's a handy tree visualization command that we can use.
Run this in the same directory as your project:
mvn dependency:tree
Operational Output
Successful output on my workstation looks like this: org.test.hadoop:hadoop:jar:1.0-SNAPSHOT
+- junit:junit:jar:3.8.1:test
+- org.apache.hadoop:hadoop-core:jar:1.2.1:compile
| +- commons-cli:commons-cli:jar:1.2:compile
| +- xmlenc:xmlenc:jar:0.52:compile
| +- com.sun.jersey:jersey-core:jar:1.8:compile
| +- com.sun.jersey:jersey-json:jar:1.8:compile
| | +- org.codehaus.jettison:jettison:jar:1.1:compile
| | | \- stax:stax-api:jar:1.0.1:compile
| | +- com.sun.xml.bind:jaxb-impl:jar:2.2.3-1:compile
| | | \- javax.xml.bind:jaxb-api:jar:2.2.2:compile
| | | +- javax.xml.stream:stax-api:jar:1.0-2:compile
| | | \- javax.activation:activation:jar:1.1:compile
| | +- org.codehaus.jackson:jackson-jaxrs:jar:1.7.1:compile
| | \- org.codehaus.jackson:jackson-xc:jar:1.7.1:compile
| +- com.sun.jersey:jersey-server:jar:1.8:compile
| | \- asm:asm:jar:3.1:compile
| +- commons-io:commons-io:jar:2.1:compile
| +- commons-httpclient:commons-httpclient:jar:3.0.1:compile
| +- commons-codec:commons-codec:jar:1.4:compile
| +- org.apache.commons:commons-math:jar:2.1:compile
| +- commons-configuration:commons-configuration:jar:1.6:compile
| | +- commons-digester:commons-digester:jar:1.8:compile
| | | \- commons-beanutils:commons-beanutils:jar:1.7.0:compile
| | \- commons-beanutils:commons-beanutils-core:jar:1.8.0:compile
| +- commons-net:commons-net:jar:1.4.1:compile
| +- org.mortbay.jetty:jetty:jar:6.1.26:compile
| | \- org.mortbay.jetty:servlet-api:jar:2.5-20081211:compile
| +- org.mortbay.jetty:jetty-util:jar:6.1.26:compile
| +- tomcat:jasper-runtime:jar:5.5.12:compile
| +- tomcat:jasper-compiler:jar:5.5.12:compile
| +- org.mortbay.jetty:jsp-api-2.1:jar:6.1.14:compile
| | \- org.mortbay.jetty:servlet-api-2.5:jar:6.1.14:compile
| +- org.mortbay.jetty:jsp-2.1:jar:6.1.14:compile
| | \- ant:ant:jar:1.6.5:compile
| +- commons-el:commons-el:jar:1.0:compile
| +- net.java.dev.jets3t:jets3t:jar:0.6.1:compile
| +- hsqldb:hsqldb:jar:1.8.0.10:compile
| +- oro:oro:jar:2.0.8:compile
| +- org.eclipse.jdt:core:jar:3.1.1:compile
| \- org.codehaus.jackson:jackson-mapper-asl:jar:1.8.8:compile
+- org.apache.hadoop:hadoop-client:jar:2.5.2:compile
| +- org.apache.hadoop:hadoop-mapreduce-client-app:jar:2.5.2:compile
| | +- org.apache.hadoop:hadoop-mapreduce-client-common:jar:2.5.2:compile
| | | +- org.apache.hadoop:hadoop-yarn-client:jar:2.5.2:compile
| | | | \- com.sun.jersey:jersey-client:jar:1.9:compile
| | | \- org.apache.hadoop:hadoop-yarn-server-common:jar:2.5.2:compile
| | \- org.apache.hadoop:hadoop-mapreduce-client-shuffle:jar:2.5.2:compile
| | \- org.fusesource.leveldbjni:leveldbjni-all:jar:1.8:compile
| +- org.apache.hadoop:hadoop-yarn-api:jar:2.5.2:compile
| +- org.apache.hadoop:hadoop-mapreduce-client-core:jar:2.5.2:compile
| | \- org.apache.hadoop:hadoop-yarn-common:jar:2.5.2:compile
| +- org.apache.hadoop:hadoop-mapreduce-client-jobclient:jar:2.5.2:compile
| \- org.apache.hadoop:hadoop-annotations:jar:2.5.2:compile
+- org.apache.hadoop:hadoop-hdfs:jar:2.5.2:compile
| +- com.google.guava:guava:jar:11.0.2:compile
| +- commons-lang:commons-lang:jar:2.6:compile
| +- commons-logging:commons-logging:jar:1.1.3:compile
| +- commons-daemon:commons-daemon:jar:1.0.13:compile
| +- javax.servlet.jsp:jsp-api:jar:2.1:compile
| +- log4j:log4j:jar:1.2.17:compile
| +- com.google.protobuf:protobuf-java:jar:2.5.0:compile
| +- javax.servlet:servlet-api:jar:2.5:compile
| +- org.codehaus.jackson:jackson-core-asl:jar:1.9.13:compile
| \- io.netty:netty:jar:3.6.2.Final:compile
\- org.apache.hadoop:hadoop-common:jar:2.5.2:compile
+- org.apache.commons:commons-math3:jar:3.1.1:compile
+- commons-collections:commons-collections:jar:3.2.1:compile
+- org.slf4j:slf4j-api:jar:1.7.5:compile
+- org.slf4j:slf4j-log4j12:jar:1.7.5:compile
+- org.apache.avro:avro:jar:1.7.4:compile
| +- com.thoughtworks.paranamer:paranamer:jar:2.3:compile
| \- org.xerial.snappy:snappy-java:jar:1.0.4.1:compile
+- org.apache.hadoop:hadoop-auth:jar:2.5.2:compile
| +- org.apache.httpcomponents:httpclient:jar:4.2.5:compile
| | \- org.apache.httpcomponents:httpcore:jar:4.2.4:compile
| \- org.apache.directory.server:apacheds-kerberos-codec:jar:2.0.0-M15:compile
| +- org.apache.directory.server:apacheds-i18n:jar:2.0.0-M15:compile
| +- org.apache.directory.api:api-asn1-api:jar:1.0.0-M20:compile
| \- org.apache.directory.api:api-util:jar:1.0.0-M20:compile
+- com.jcraft:jsch:jar:0.1.42:compile
+- com.google.code.findbugs:jsr305:jar:1.3.9:compile
+- org.apache.zookeeper:zookeeper:jar:3.4.6:compile
\- org.apache.commons:commons-compress:jar:1.4.1:compile
\- org.tukaani:xz:jar:1.0:compile
The effort to establish all of those dependencies manually would be immense. Manual dependency resolution is possible, but extremely difficult and error prone.
This comment has been removed by the author.
ReplyDeleteThis comment has been removed by a blog administrator.
ReplyDeleteYour very own commitment to getting the message throughout came to be rather powerful and have consistently enabled employees just like me to arrive at their desired goals.
ReplyDeleteJava Training in Chennai | Best Java Training in Chennai
C C++ Training in Chennai | Best C C++ Training in Chennai
Web Designing Training in Chennai | Best Web Designing Training in Chennai
I simply wanted to write down a quick word to say thanks to you for those wonderful tips and hints you are showing on this site.
ReplyDeleteBest PHP Training Institute in Chennai|PHP Course in chennai
Best .Net Training Institute in Chennai
Dotnet Training in Chennai
Dotnet Training in Chennai
Great post very useful info thanks for this post ....
ReplyDeleteAws training chennai | AWS course in chennai
Great post very useful info thanks for this post ....
ReplyDeleteAws training chennai | AWS course in chennai
amazing tips At SynergisticIT we offer the best hadoop training in california
ReplyDelete