How Do You Troubleshoot an Empty Multi-node Hadoop Cluster?

Problem scenario
One or more of the following is happening:
1) There are 0 DataNodes in your Hadoop cluster according to an error message
2) There is 0 B configured as capacity (as shown from a “hdfs dfsadmin -report” command).
3)  There is one fewer DataNode in your Hadoop cluster than you expect.
4)  You run “hdfs dfsadmin -report | grep Hostname” and do not see a node that has its DataNode service (as seen with the jps command) started and stopped from the NameNode with corresponding start-dfs.sh and stop-dfs.sh script runs.

How Do You Add a New Node to a Hadoop Cluster?

Problem scenario
You have a multi-node cluster of Hadoop.  You want to add a new data node.  What do you do?

Solution
1. a)  Log into the server that will be the new DataNode.  Do these things until you get to step 2.

b)  Install Hadoop on the new DataNode.  If you do not know how, see this posting.

c) 

 » Read more..

What Are the Different Acronym Stacks in I.T.?

Question
What are the different acronym stacks in I.T.?

Answer
There are many open source combinations of technologies that are in wide use.  These acronyms referred to as “full stacks” or “stacks” appear in articles and job descriptions.  A full stack is a bundle of software that (includes an OS and) can create a complete and functional product when properly configured. 

 » Read more..

How Do You Install Hadoop with a Script for Any Type of Linux Server?

Updated on 1/22/19

Problem scenario
You want to install open source Hadoop.  You may want a single-node or multi-node deployment with CentOS/RedHat/Fedora, Debian/Ubuntu, and/or SUSE Linux distributions.  You want to have most of it scripted and have the same script work on any variety of Linux.  How do you install Hadoop quickly with a script that works on almost any type of Linux?

Solution
1. 

 » Read more..

What is Apache Parquet?

Question
What is Apache Parquet?

Answer
Apache Parquet is columnar data representation/manipulation tool for a Hadoop ecosystem.  Data in a given column is largely uniform (e.g., a long string of characters, a single character, or an integer) in that it repeats a specific type and format of data as opposed to two cells in the same row (which may be very dissimilar types of data). 

 » Read more..

How Do You Install Apache Parquet?

Problem scenario
You want to install Apache Parquet on the Hadoop namenode.  What do you do?

Solution
Prerequisite
This assumes that you have installed Hadoop.  For directions, see this posting.

Procedure
Run these commands:

sudo su –
apt-get -y install pip
pip install thriftpy
pip install snappy
exit

sudo apt-get -y install libsnappy-dev thrift-compiler

curl https://pypi.python.org/packages/74/b5/bc459aab0566fc3cf3397467922c37411ab6e3361bab9e0ca165e1089ce8/parquet-1.2.tar.gz#md5=05aacec0620ac63ecd7dd77bf7fb9fee >

 » Read more..

What Is a “data swamp”?

Question
What is a “data swamp”?

Answer
data swamp is best defined as a severely degraded data lake.  The term data swamp connotes poor governance and negligent management that caused a data lake to gradually lose its value.  A data swamp is data lake that was once useful but through negligent utilization can no longer be used by even highly talented analytics professionals. 

 » Read more..

How Do You Configure Maven to Use an Apache Parquet Plugin?

Problem scenario
You want to use Maven’s Apache Parquet plugin with Hadoop.  How do you use these Apache technologies together?

Solution
1.  Install HDFS.  See this link if you are using Ubuntu.  See this link if you are using a RedHat distribution of Linux. If you have more than one server and want a multi-node cluster of Hadoop, see this link for directions on how to deploy and configure it.

 » Read more..

How Do You Learn More About “Low Latency Programming”?

Problem scenario
You want to learn more about low latency programming for working at a hedge fund company.  Where do you begin?

Solution
An Amazon search for “low latency programming” brings up books that could be very useful for other high-paying I.T. jobs in big data such as an Apache Thrift book or a book for Building a Columnar Database on RAMCloud

 » Read more..

How Do You Troubleshoot the Hadoop NameNode Service Not Running Properly?

Problem scenario
The Hadoop NameNode service won’t start.  You run start-dfs.sh and it starts the SecondaryNameNode and the DataNode.  It won’t start the NameNode.

Solution
Assumption: This solution only works if you are running the start-dfs.sh as a sudo user or the root user itself.

Procedures
Verify you can ssh with root to the localhost.  Run this command twice without exiting from the first session: 

 » Read more..