Hadoop: Loading data into Hadoop

The HDFS File System

HDFS is not fully POSIX-compliant. The requirements for a POSIX file-system differ from the target goals for a Hadoop application.

HDFS is a distributed filesystem that stores large files across multiple machines. Just like a Unix filesystem, HDFS allows users to manipulate the filesystem using shell commands. Most HDFS commands have a one-to-one correspondence with Unix commands.

Assumptions

This section assumes that:

you are already logged onto a Linux NameNode

and transferring files from the filesystem of that NameNode onto the HDFS filesystem.
For instructions on how to copy files onto the NameNode itself (perhaps from a Windows machine), please read this article.

Hadoop has been started.

Copying Files into HDFS

In this example, I have some news article data on my home directory.

I'm going to copy this data into my HDFS filesystem:

hdfs dfs -mkdir /nyt
hdfs dfs -put ~/nyt /nyt
hdfs dfs -ls /nyt

Removing Files from HDFS

Using this command, I can delete data from the directory I created in the prior command:

craigtrim@CVB:/usr/lib/apache/hadoop/2.5.2/bin$ hdfs dfs -rm -r /nyt
2014-11-24 14:21:03,517 INFO [main] fs.TrashPolicyDefault (TrashPolicyDefault.java:initialize(92)) - Namenode trash configuration: Deletion interval = 0 minutes, Emptier interval = 0 minutes.  
Deleted /nyt

The Web Interface

It is possible to browse the HDFS filesystem using the NameNode Web Interface.

The URL for the NameNode Web Interface can be found at:

http://192.168.x.y:50070

Click on the Utilities > Browse the File System tab sequence in the menu header, and visually browse the filesystem in read only mode:

Hadoop

Monday, November 24, 2014

Working with the Hadoop Distributed File System

The HDFS File System

Assumptions

Copying Files into HDFS

Removing Files from HDFS

The Web Interface