How Do You See the Storage Usage on a Datanode in an hdfs System?

Problem scenario
You want to see which datanodes are active underlying a given hdfs system.  You also want to know statistics about the storage usage.  If you are regularly adding data to your hdfs system, you want to stay below 70% utilization.  If you want your hdfs system to perform well but you are not regularly adding new files, you want to stay under 80%.  How do you find out about the storage usage of your datanodes that support your hdfs system?

How Do You Troubleshoot the Hadoop Error “Could not find or load main class jar”?

Problem scenario
You are trying to run a Hadoop operation.  You issue your “hadoop jar” or “hdfs jar” command.  You get one of these errors:

“Error: Could not find or load main class jar”

or this error

“Error: Could not find or load main class hadoop-streaming-2.8.1.jar”

What is wrong?

Solution
1.  Use “hadoop…” instead of “hdfs”.

2. 

What is MapReduce in Hadoop?

Problem scenario
You want to learn more about MapReduce.  You want to learn what it does so you can grasp Hadoop more thoroughly.  What is MapReduce?

Solution
MapReduce is a core process of Hadoop.  With multi-node deployments of Hadoop, MapReduce distributes data to different datanodes (servers that are controlled by a master server).  This data is retrievable thanks to MapReduce.  Conceptually there are two main components to MapReduce: mapper and reducer. 

How Do You Solve the Problem of Being Prompted for Credentials after You Run the start-dfs.sh Script?

Problem scenario
You try to start Hadoop’s dfs in a multi-node deployment.  All the Hadoop nodes are running Linux.  You run this:

bash /usr/local/hadoop/sbin/start-dfs.sh

You see this:

Starting namenodes on [hadoopmaster]
hadoopmaster: starting namenode, logging to /usr/local/hadoop/logs/hadoop-root-namenode-hadoopmaster.out
root@hadoopdatanode’s password: localhost: starting datanode, logging to /usr/local/hadoop/logs/hadoop-root-datanode-hadoopmaster.out
hadoopmaster: starting datanode, logging to /usr/local/hadoop/logs/hadoop-root-datanode-hadoopmaster.out
localhost: ulimit -a for user root
localhost: core file size         

How Do You Set up a Multi-Node Cluster of Hadoop with Linux Servers?

Updated 1/30/18

Problem scenario
You want to deploy open-source Hadoop as a multi-node cluster to some Linux servers (e.g., two more more servers as opposed to a single-server deployment).  You want to use AWS, Azure and/or Google Cloud Platform Linux servers, possibly in a cross-cloud configuration (using two or three public cloud services).  What do you do to install and configure Hadoop to leverage two or more computers in a potentially hybrid cloud environment?

How Do You Run a MapReduce Job (with Python)?

Problem scenario
You have studied what a MapReduce job is conceptually.  But you want to try it out.  You have a Linux server with the Hadoop namenode (aka master node) installed on it.  How do you run a MapReduce job [to understand what it is all about]?

Solution
This is just an example.  We tried to make this as simple as possible.  This assumes that Python has been installed on the Hadoop namenode (aka master node) running on Linux.

How Do You Get All of Your Live Nodes to Appear in The “hdfs dfsadmin -report” Results?

Problem scenario
You use the “hdfs dfsadmin -report” command on your Hadoop NameNode server.  You see that there is a small amount of “DFS Remaining.”  You expect many more GB. You also see one datanode not listed under the “Live datanodes” section. All of your live datanodes are not appearing in hdfs,

Your /usr/local/hadoop/etc/hadoop/slaves file has the DNS name of a server you expect to be a datanode. 

How Do You Know If Yarn or Hadoop’s NameNode Services Are Running?

Problem scenario
How do you know if YARN or Hadoop’s NameNode services are running or not?

Solution
Use the “sudo jps” command.  Some Linux users may not have sufficient permissions to install it.  

If the command is not found, go to the option below for your distribution of Linux:

If you are running a RedHat derivative (e.g.,