What Is a “data swamp”?

Question
What is a “data swamp”?

Answer
data swamp is best defined as a severely degraded data lake.  The term data swamp connotes poor governance and negligent management that caused a data lake to gradually lose its value.  A data swamp is data lake that was once useful but through negligent utilization can no longer be used by even highly talented analytics professionals. 

How Do You Configure Maven to Use an Apache Parquet Plugin?

Problem scenario
You want to use Maven’s Apache Parquet plugin with Hadoop.  How do you use these Apache technologies together?

Solution
1.  Install HDFS.  See this link if you are using Ubuntu.  See this link if you are using a RedHat distribution of Linux. If you have more than one server and want a multi-node cluster of Hadoop, see this link for directions on how to deploy and configure it.

How Do You Learn More About “Low Latency Programming”?

Problem scenario
You want to learn more about low latency programming for working at a hedge fund company.  Where do you begin?

Solution
An Amazon search for “low latency programming” brings up books that could be very useful for other high-paying I.T. jobs in big data such as an Apache Thrift book or a book for Building a Columnar Database on RAMCloud

How Do You Troubleshoot the Hadoop NameNode Service Not Running Properly?

Problem scenario
The Hadoop NameNode service won’t start.  You run start-dfs.sh and it starts the SecondaryNameNode and the DataNode.  It won’t start the NameNode.

Solution
Assumption: This solution only works if you are running the start-dfs.sh as a sudo user or the root user itself.

Procedures
Verify you can ssh with root to the localhost.  Run this command twice without exiting from the first session: 

How Do You Troubleshoot the Error “WARN conf.Configuration: bad conf file: element not < property >”?

Problem scenario
You are trying to start HDFS.  But you get this error: “WARN conf.Configuration: bad conf file: element not <property>”

What should you do?

Solution
Look at mapred-site.xml, hdfs-site.xml, and core-site.xml files.  Verify you have a <property> tag in each file.

cat /path/to/mapred-site.xml | grep -i property
cat /path/to/hdfs-site.xml | grep -i property
cat /path/to/core-site.xml | grep -i property

If there is no <property>

How Do You Troubleshoot the Problem “ImportError: Entry Point (‘console_scripts’, ‘Parquet’) Not Found”?

Problem scenario
You are trying to run Apache Parquet commands.  But each command gives this error:

Traceback (most recent call last):
  File “/usr/local/bin/parquet”, line 11, in <module>
    load_entry_point(‘parquet==1.2’, ‘console_scripts’, ‘parquet’)()
  File “/home/ubuntu/.local/lib/python2.7/site-packages/pkg_resources/__init__.py”, line 570, in load_entry_point
    return get_distribution(dist).load_entry_point(group, name)
  File “/home/ubuntu/.local/lib/python2.7/site-packages/pkg_resources/__init__.py”, line 2750, in load_entry_point
    raise ImportError(“Entry point %r not found” % ((group, name),))
ImportError: Entry point (‘console_scripts’,

How Do You Troubleshoot a Problem with Adding a DataNode to a Hadoop Cluster?

Problem scenario
You are trying to add a DataNode to an existing Hadoop cluster.  There are numerous problems.  What do you do to troubleshoot the process?

Possible solutions
1.  New versions of Hadoop use a “workers” file — not a “slaves” file.

2.  Do you have a DNS solution in place for your DataNode and NameNode to resolve each other?  If you do not have a DNS server,

How Do You Get the Hadoop Yarn Web UI to Work?

Problem scenario
You want to get Hadoop’s web UI to work.  You have access to the back-end of a Linux server.  How do you get the front-end of Hadoop (or hdfs) to work?

Solution
1.  Deploy Hadoop.  See this link for directions for any type of Linux. 
2.  Start the Hadoop services.  For a single-node deployment of Hadoop use sudo find / -name start-dfs.sh to find it;

How Do You See the Storage Usage on a Datanode in an hdfs System?

Problem scenario
You want to see which datanodes are active underlying a given hdfs system.  You also want to know statistics about the storage usage.  If you are regularly adding data to your hdfs system, you want to stay below 70% utilization.  If you want your hdfs system to perform well but you are not regularly adding new files, you want to stay under 80%.  How do you find out about the storage usage of your datanodes that support your hdfs system?