How Do You Troubleshoot a Problem with Adding a DataNode to a Hadoop Cluster?

Problem scenario
You are trying to add a DataNode to an existing Hadoop cluster.  There are numerous problems.  What do you do to troubleshoot the process?

Possible solutions
1.  New versions of Hadoop use a "workers" file -- not a "slaves" file.

2.  Do you have a DNS solution in place for your DataNode and NameNode to resolve each other?  If you do not have a DNS server, does the /etc/hosts file of the DataNode server have an entry for the NameNode?  Can the NameNode resolve the domain name of the DataNode servers individually via the NameNode itself?

3.  Can the "hduser" (or whichever user starts the Hadoop cluster) passwordlessly SSH into the DataNode from the NameNode?  This is generally necessary for a multi-node cluster.  If you need directions for setting up passwordless SSH, see this article.

4.  Warning: This will delete all the data from your cluster. If you are having a problem adding a DataNode to a cluster, you may want to try deleting all directories and files in /app/hadoop/tmp/ on the individual DataNodes that are not being added when you run on the NameNode.  After you delete the data in /app/hadoop/tmp/ on the DataNodes, you may want to run hdfs namenode -format on the NameNode.  This will delete all the data in your Hadoop cluster; however it can help with troubleshooting.  You can then run the script to see if the DataNodes will be added to the cluster.

5.  When you run the jps command on the NameNode and DataNode, do you see a Hadoop node component running?  You may need to shut down all node services first and restart them.  To install jps, see this article.

6.  Check the /usr/local/hadoop/etc/hadoop/core-site.xml files on the DataNodes.  Do they have "localhost" or do they have the hostname of the NameNode?  They need the hostname of the NameNode.  If you forgot to modify these files, the respective DataNode will not join the cluster as normal.  The "jps" command will still show the "DataNode" service starting and stopping as you control it via the NameNode.  But the DataNode will not make its storage capacity available.

7.  If the DataNode service starts and stops with the NameNode's run of and but you are not seeing the DataNode in the cluster with an "hdfs dfsadmin -report" command, the problem could be that there is a firewall rule protecting the NameNode that blocks port 54310.

You may want to use nmap on the DataNode to determine if the port defined in /usr/local/hadoop/etc/hadoop/core-site.xml (the path may be different) is not blocked.  If this port is blocked you can still start and stop the DataNode service from the NameNode, but there will be a problem with the DataNode actually working in the cluster. 

8.  Are you using the script on the NameNode?  Some Hadoop administrators use combinations of on both the NameNode and the DataNode.  It may be easier if you use the script from the NameNode.  The script is not advisable in production environments.  This is just for troubleshooting.

9. Search for a dfs.hosts.exclude file. Is the server that will not join the cluster in there?

10. Do you have a dfs.include or dfs.hosts file? Could you try adding the server to one of these files and restarting the services on the name node?

11.  Warning: This will delete all the data from your cluster. You may want to read these step-by-step directions to start over with your deployment (i.e., you may want to reinstall and reconfigure from the beginning).

12. Optional reading on Apache's website.

Leave a comment

Your email address will not be published. Required fields are marked *