Problem scenario
You have a multi-node cluster of Hadoop. You want to add a new data node. What do you do?
Solution
1. a) Log into the server that will be the new DataNode. Do these things until you get to step 2.
b) Install Hadoop on the new DataNode. If you do not know how, see this posting.
c) The DataNode server must be configured to resolve the NameNode's hostname. Modify the new DataNode's /etc/hosts file if you do not have a DNS server on your network for this server that will be the DataNode. The modification the /etc/hosts file should get (if there is no DNS server for it) should be an entry for the NameNode server.
d)i. You need to modify three .xml files on the DataNode. These files are core-site.xml, mapred-site.xml, and hdfs-site.xml. To do these modifications, see steps 6 through 8 on this set of instructions.
d)ii. Look at the port used in core-site.xml. Verify this port is open from the DataNode to the NameNode. If you do not know how to test and the DataNode is running Windows, see this posting. If you do not know how to test and the DataNode us running Linux, install nmap and use this command: nmap -p xxxxx FQDNofNameNode # where xxxxx is the port number and FQDNofNameNode is the FQDN of the NameNode.
e) Create an hduser account. (You'll need this for step 2e below.) Create an hadoop group name. Ensure that the hduser can ssh into the local machine. If you do not know how to do all these things, see steps 2 and 3 in this posting.
f) Run these two commands:
cd /app
sudo chown hduser:hadoop hadoop
2. a) Log into the NameNode server. Do these things:
b) The NameNode server must be configured to resolve the DataNode's hostname. Modify the new NameNode's /etc/hosts file if you do not have a DNS server on your network for the NameNode. The modification the /etc/hosts file should get (if there is no DNS server for it) should be an entry for the DataNode server.
c) On the NameNode server, append the new DataNode's hostname to this file /usr/local/hadoop/etc/hadoop/workers
d) Modify the hdfs-site.xml file. vi /usr/local/hadoop/etc/hadoop/hdfs-site.xml # Update the dfs.replication integer value to be the sum of the servers incluing the NameNode and all DataNodes (including the one you added).
e) Configure passwordless SSH authentication from the NameNode's hduser account to the server that will be the DataNode. See this posting if you do not know how. Test the ability to passwordless SSH.
f) From the NameNode run ssh hduser@DataNodeHostName # where DataNodeHostName is the host name of the DataNode. This tests it will work.
g) Run these commands (but note that it will delete all the data in your cluster) as the hduser (or designated Linux username for running such commands):
hdfs namenode -format
# When prompted with this:
# "Re-format filesystem in Storage Directory /app/hadoop/tmp/dfs/name ? (Y or N)"
# choose "Y" (with no quotes) and press enter
bash /usr/local/hadoop/sbin/start-dfs.sh