One or more of the following is happening:
1) There are 0 DataNodes in your Hadoop cluster according to an error message
2) There is 0 B configured as capacity (as shown from a “hdfs dfsadmin -report” command).
3) There is one fewer DataNode in your Hadoop cluster than you expect.
4) You run “hdfs dfsadmin -report | grep Hostname” and do not see a node that has its DataNode service (as seen with the jps command) started and stopped from the NameNode with corresponding start-dfs.sh and stop-dfs.sh script runs.
You have a multi-node cluster of Hadoop. You want to add a new data node. What do you do?
1. a) Log into the server that will be the new DataNode. Do these things until you get to step 2.
b) Install Hadoop on the new DataNode. If you do not know how, see this posting.
What are the different acronym stacks in I.T.?
There are many open source combinations of technologies that are in wide use. These acronyms referred to as “full stacks” or “stacks” appear in articles and job descriptions. A full stack is a bundle of software that (includes an OS and) can create a complete and functional product when properly configured.
Updated on 1/22/19
You want to install open source Hadoop. You may want a single-node or multi-node deployment with CentOS/RedHat/Fedora, Debian/Ubuntu, and/or SUSE Linux distributions. You want to have most of it scripted and have the same script work on any variety of Linux. How do you install Hadoop quickly with a script that works on almost any type of Linux?
What is Apache Parquet?
Apache Parquet is columnar data representation/manipulation tool for a Hadoop ecosystem. Data in a given column is largely uniform (e.g., a long string of characters, a single character, or an integer) in that it repeats a specific type and format of data as opposed to two cells in the same row (which may be very dissimilar types of data).
You want to install Apache Parquet on the Hadoop namenode. What do you do?
This assumes that you have installed Hadoop. For directions, see this posting.
Run these commands:
sudo su –
apt-get -y install pip
pip install thriftpy
pip install snappy
sudo apt-get -y install libsnappy-dev thrift-compiler
curl https://pypi.python.org/packages/74/b5/bc459aab0566fc3cf3397467922c37411ab6e3361bab9e0ca165e1089ce8/parquet-1.2.tar.gz#md5=05aacec0620ac63ecd7dd77bf7fb9fee >
What is a “data swamp”?
A data swamp is best defined as a severely degraded data lake. The term data swamp connotes poor governance and negligent management that caused a data lake to gradually lose its value. A data swamp is data lake that was once useful but through negligent utilization can no longer be used by even highly talented analytics professionals.
You want to use Maven’s Apache Parquet plugin with Hadoop. How do you use these Apache technologies together?
1. Install HDFS. See this link if you are using Ubuntu. See this link if you are using a RedHat distribution of Linux. If you have more than one server and want a multi-node cluster of Hadoop, see this link for directions on how to deploy and configure it.
You want to learn more about low latency programming for working at a hedge fund company. Where do you begin?
An Amazon search for “low latency programming” brings up books that could be very useful for other high-paying I.T. jobs in big data such as an Apache Thrift book or a book for Building a Columnar Database on RAMCloud.
The Hadoop NameNode service won’t start. You run start-dfs.sh and it starts the SecondaryNameNode and the DataNode. It won’t start the NameNode.
Assumption: This solution only works if you are running the start-dfs.sh as a sudo user or the root user itself.
Verify you can ssh with root to the localhost. Run this command twice without exiting from the first session: