Updated 1/30/18
Problem scenario
You want to deploy open-source Hadoop as a multi-node cluster to some Linux servers (e.g., two more more servers as opposed to a single-server deployment). You want to use AWS, Azure and/or Google Cloud Platform Linux servers, possibly in a cross-cloud configuration (using two or three public cloud services). What do you do to install and configure Hadoop to leverage two or more computers in a potentially hybrid cloud environment?
Solution
These directions work with all the servers being in AWS, Azure, Google Cloud Platform or a mix of each of them! (These directions should work for a hybrid cloud deployment without much extra configuration for you. We recommend using nmap and being methodical with checking ports very early in the process if you are deploying a hybrid cloud Hadoop multi-node cluster.) These directions work regardless of what type of Linux you are using (e.g., CentOS/RedHat/Fedora, Debian/Ubuntu, or SUSE). It can take less than 15 minutes if there are not too many nodes and the nodes are not too big.
You need just two Linux servers: one for the master server (referred to as the NameNode) and one for the slave node (referred to as the DataNode). These directions were designed for open-source Hadoop. The relevant security groups (for AWS, or NSGs for Azure, or firewall rules for Google Cloud Platform) need to allow connectivity between the two clouds. If you are going to deploy it using two clouds, see the "Tips for deploying a multi-node cluster" at the bottom.
1. Install Hadoop on at least two different Linux servers. If you need directions, there is an article for each the three major distributions of Linux (CentOS/RedHat/Fedora, Debian/Ubuntu and SUSE). For these directions below you will need a group called 'hadoop' and a user called 'hduser' on every NameNode and DataNode. (These names could be different depending on your specific needs. If you follow the hyperlinked article above, you will create this group and user so that the rest of these directions below will work.) Remember your hduser account passwords. Mentally designate one Linux server as the NameNode (previously referred to as the "master"). The other server(s) will be the DataNodes (previously known as slave(s)).
2.a. Configure the /etc/hosts file that is on each NameNode (master) server and DataNode (slave) server(s). A NameNode server's /etc/hosts file must have an entry of each DataNode. Datanodes' /etc/hosts files must have one entry for the NameNode server. The IP address you use depends on whether or not the server you are adding to the /etc/hosts file is in the same cloud. If the NameNode and DataNode are in the same cloud, you would use the internal IP address of the server for which you are entering. For hetergeneous clouds (e.g., Azure having a NameNode and AWS having a DataNode), make sure you use corresponding external IP addresses in the /etc/hosts file if the entry is referring to server in a different cloud.
2.b. Ensure that the AWS Security Group, or Azure NSG, or Google Cloud Platform firewall rule does not restrict traffic between the NameNode and DataNode. For AWS Security Group configuration, the IP addresses will be internal if they are all in the same AWS VPC and governed by the same Security Group. If NameNode is in a heterogeneous public cloud relative to the DataNode, then external IP addresses will be needed for Rules to allow the servers to communicate.
3. The NameNode must be able to one, passwordlessly ssh to itself with hduser; two, passwordlessly ssh to the DataNode(s) with hduser. For the second one, from the NameNode server, run these commands:
su hduser
cat /home/hduser/.ssh/id_rsa.pub
# If the above is not found, you may need to run this: ssh-keygen -t rsa -P "" # press enter to the next prompt
4. Copy the content from the last command above and do on each DataNode server, append it to /home/hduser/.ssh/authorized_keys. After you save the file, ensure that this file /home/hduser/.ssh/authorized_keys has rw------- permissions. Use chmod 600 authorized_keys if necessary.
5. On the Hadoop NameNode server, modify this file /usr/local/hadoop/etc/hadoop/workers
Add a new line with the domain name of each DataNode server. It will look like this:
coolnameofserver
Remember to save the file.
6. Modify the core-site.xml file (e.g., often found at /usr/local/hadoop/etc/hadoop/core-site.xml). Find the <property> tag and the tag inside it named <value>. Change the "localhost" to the domain name of the NameNode server itself.
7. Modify /usr/local/hadoop/etc/hadoop/mapred-site.xml
In between the <configuration> and </configuration> tags, place this text with "NameNode" being replaced by the domain name of your NameNode server:
<property>
<name>mapred.job.tracker</name>
<value>NameNode:54311</value>
<description>The host and port that the MapReduce job tracker runs
at. If "local", then jobs are run in-process as a single map
and reduce task.
</description>
</property>
8. Does the hdfs-site.xml (e.g., /usr/local/hadoop/etc/hadoop/hdfs-site.xml) have a dfs.replication value element that is correct? The value is computed by adding the number of NameNode and DataNode servers that will be in the Hadoop cluster once the new node has been added.
Find the hdfs-site.xml file (e.g., /usr/local/hadoop/etc/hadoop/hdfs-site.xml). Inside it find the "dfs.replication" element. Underneath it will be a <value> number. Modify it to be "2" (with no quotes) if you are configuring one NameNode and one DataNode Hadoop server. The number should be equivalent to total number of nodes that are your NameNode and DataNode servers in the Hadoop cluster you are configuring.
If you find no such element, copy this text inside the <configuration> and </configuration> tags (but replace the "2" with the sum of your NameNode and DataNodes in your desired Hadoop cluster):
<property>
<name>dfs.replication</name>
<value>2</value>
<description>Default block replication.
The actual number of replications can be specified when the file is created.
The default is used if replication is not specified in create time.
</description>
</property>
9. Repeat steps 7 through 9 for each remaining DateNode server.
10. Start the multi-node instance. Log into the NameNode as hduser (e.g., su hduser). Then run this command:
bash /usr/local/hadoop/sbin/start-all.sh
Tips for deploying a multi-node cluster of Hadoop with both AWS and Azure
By default you will not be able to ping Azure instances. But this does not mean SSH will not work. ICMP packets are turned off by default in Azure.
The datanodes can be in AWS as long as the AWS instances are governed by a Security Group which allows inbound connections originating from the Azure Hadoop NameNode. To find its external IP address, go to the back end of the NameNode and run "curl http://icanhazip.com" (with no quotes). The other requirement is that the Network Security Group in Azure for the Hadoop NameNode must allow communication from the AWS datanodes.
Tips for cluster deployments to hybrid clouds involving both AWS and Google Cloud Platform
The datanodes' DataNode service (as shown with the jps command) may be controlled by the NameNode, but the node may not be in the cluster unless the NameNode accepts connectivity over the port configured in the core-site.xml file.
You may want to see this posting for various troubleshooting when adding a node to the cluster.