Hadoop: Cloning for Clusters

After completing SSH and Base Node configuration, I recommend cloning your virtual machine.

One of the clones will be designated the Master (namenode) and the rest of the clones will be designated as Slaves (datanode). I recommend retaining the original VM as a rollback environment. The advantage to cloning at this point is that each clone will be able to talk to any other clone without further setup.

We have a single VM instance with a public key, private key, and authorized_keys file. The VM is able to SSH back into itself without requiring a password.

If we clone the VM at this point, we'll get a situation like this:

The clone will automatically be able to SSH into the original VM, and vice versa. Both the clone and the original have identical public and private keys.

Likewise, if we create an additional clone, we have this:

and so on.

Update /etc/hosts (again)

If you've been following along in this series, you already updated this file with the master host.

The hosts file is a computer file used by an operating system to map hostnames to IP addresses. The hosts file is a plain text file, and is conventionally named hosts.

It will be helpful, for the purpose of subsequent configuration, if we modify the hosts file on this VM instance to include the name of each slave and associated IP address.

I currently have two slaves in my cluster, and my hosts file now looks like this:

127.0.0.1     localhost
127.0.1.1     CVB
192.168.1.43     master
192.168.1.45     slave1
192.168.1.46     slave2
# The following lines are desirable for IPv6 capable hosts
::1   ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters

The bold text above was added just now. The line above that (for the master node) was added when I configured the base VM node.

I prefer to create this file on one node with all the relevant information, and then copy it to each clone that will form part of the cluster. While not strictly necessary, I find it convenient when each node in the cluster has equivalent information.

Here's a convenient way of copying your hosts file to Slave nodes:

sudo scp /etc/hosts craigtrim@slave1:~
ssh slave1
sudo mv ~/hosts /etc/hosts

You'll need to do this for each designated Slave in the cluster.

The Slaves

The conf/slaves file lists the hosts, one per line, where the Hadoop slave daemons (DataNodes and TaskTrackers) will be run. Just like the masters conf file, list one IP address (or host name) per line.

sudo gedit $HADOOP_HOME/conf/slaves

My file is this simple:

slave1
slave2

How do I know if the slaves file was used correctly?
When running the start-dfs file below, you'll see something like this

 master: STARTUP_MSG:  build = https://git-wip-us.apache.org/repos/asf/hadoop.git -r 2e18d179e4a8065b6a9f29cf2de9451891265cce; compiled by 'jenkins' on 2014-09-05T23:11Z  
 master: STARTUP_MSG:  java = 1.8.0_25  
 master: ************************************************************/  
 192.168.1.35: starting datanode, logging to /usr/lib/apache/hadoop/2.5.1/logs/hadoop-craigtrim-datanode-CVB.out  
 192.168.1.39: starting datanode, logging to /usr/lib/apache/hadoop/2.5.1/logs/hadoop-craigtrim-datanode-CVB.out  
 192.168.1.38: starting datanode, logging to /usr/lib/apache/hadoop/2.5.1/logs/hadoop-craigtrim-datanode-CVB.out  
 192.168.1.41: starting datanode, logging to /usr/lib/apache/hadoop/2.5.1/logs/hadoop-craigtrim-datanode-CVB.out  
 192.168.1.37: starting datanode, logging to /usr/lib/apache/hadoop/2.5.1/logs/hadoop-craigtrim-datanode-CVB.out  
 192.168.1.40: starting datanode, logging to /usr/lib/apache/hadoop/2.5.1/logs/hadoop-craigtrim-datanode-CVB.out  
 192.168.1.36: starting datanode, logging to /usr/lib/apache/hadoop/2.5.1/logs/hadoop-craigtrim-datanode-CVB.out  
 192.168.1.34: starting datanode, logging to /usr/lib/apache/hadoop/2.5.1/logs/hadoop-craigtrim-datanode-CVB.out

Note each line starting with an IP address. Seeing this in the log file will prove that the conf/slaves file is being read correctly and at the right time.

Best Practices

When initially creating the cluster, I recommend creating a single master and slave node. Once you configure these successfully, and are able to start the cluster, then you can create multiple clones of the slave node. Create as many clones as you like. The only change you'll need to make to add them to the cluster is to update the slaves file in the master machine's configuration.

Copying the Keys (Optional)

If you've already cloned your VMs as described above, you can skip this step.

If you're not into cloning, you'll need to move the public key from the Master (NameNode) to each Slave (DataNode). If you're not sure whether you need this step or not, try to SSH into each designated Slave (DataNode) from your designated Master (NameNode).

If you are able to SSH without being prompted for a password, you can safely skip this step.

We'll want to move the public key at this location

~/.ssh/id_dsa.pub

To each slave node in the cluster.

We can do this using the scp command like this:

scp ~/.ssh/id_dsa.pub user@master:~/.ssh

In the slave node, you can go to the .ssh directory and list the contents; make sure this public key exists.

If everything has been set up correctly, you should be able to create a secure channel of communication from your master node to your slave node, like this:

ssh master

If the configuration has been successful, you’ll be able to set up an SSH channel without having to enter a password.

This is because the public key that belongs to the master has been copied to the slave. Note that it will not be possible to SSH from the slave to the master without a password, since we haven’t performed this operation in reverse.

Hadoop

Friday, November 21, 2014

Cloning for Clusters

Update /etc/hosts (again)

The Slaves

Best Practices

Copying the Keys (Optional)

2 comments: