How Do You Install Hadoop with a Script for Any Type of Linux Server?

Updated on 1/22/19

Problem scenario

You want to install open source Hadoop.  You may want a single-node or multi-node deployment with CentOS/RedHat/Fedora, Debian/Ubuntu, and/or SUSE Linux distributions.  You want to have most of it scripted and have the same script work on any variety of Linux.  How do you install Hadoop quickly with a script that works on almost any type of Linux?

Solution
1.  Log into Linux.     
2.a.  This step involves manual commands that depend on your type of Linux.  To find your distribution type, use the command:  cat /etc/*-release

Do one of 2.b., 2.c., or 2.d. depending on the output of the above command.

2.b.  If you are running a RedHat distribution (e.g., CentOS or Fedora), run these two commands:
sudo adduser hduser
sudo passwd hduser
  #respond with the password of your choice

2.c.i.  If you are running Ubuntu or a Debian distribution do the following:
sudo adduser hduser  # Answer how you wish to the next prompts.
# For example, you could enter a password twice according to the prompts, then just press enter five times and press "Y" (with no quotes).

2.c.ii.  (This only applies if you are running Ubuntu or Debian).  Open /home/hduser/.bashrc (e.g., with the vi command).  Add four new lines at the very bottom with this content:
PATH="$PATH":/usr/local/hadoop/bin/
export HADOOP_OPTS="$HADOOP_OPTS -Djava.library.path=/usr/local/hadoop/lib/"
export HADOOP_COMMON_LIB_NATIVE_DIR="/usr/local/hadoop/lib/native/"
export HADOOP_OPTS="$HADOOP_OPTS -Djava.library.path=$HADOOP_HOME/lib/native"

2.d.i.  If you are running Linux SUSE run these six commands:
sudo useradd hduser
sudo passwd hduser
sudo mkdir -p /home/hduser/.ssh
sudo groupadd hadoop; sudo usermod -g hadoop hduser  
sudo chown -R hduser:hadoop /home/hduser/.ssh

sudo chown -R hduser /usr/local/hadoop

2.d.ii.   (This only applies if you are running Linux SUSE).  Create /home/hduser/.bashrc with this line (or create a new line in this file and place it at the very bottom):
PATH="$PATH":/usr/local/hadoop/bin/

3.  The rest of these directions should work regardless of your Linux distribution type.  Run these commands:

sudo groupadd hadoop; sudo usermod -g hadoop hduser  # skip this step if running SUSE
su hduser # respond with the password entered earlier
ssh-keygen -t rsa -P "" # press enter to the prompt
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 600 ~/.ssh/authorized_keys
ssh 127.0.0.1   # respond with 'yes' (with no quote marks) to accept the fingerprint
exit # end the SSH section
exit # return to your regular user account

4.  Create a script (e.g., /tmp/install.sh) with the content below from "#!/bin/bash" to the final "echo ..." statement. 

#!/bin/bash
# Written by continualintegration.com

version=3.3.0   #change here if there is a higher version of Hadoop than 3.3.0 available

distro=$(cat /etc/*-release | grep NAME)

debflag=$(echo $distro | grep -i "ubuntu")
if [ -z "$debflag" ]
then   # If it is not Ubuntu, test if it is Debian.
  debflag=$(echo $distro | grep -i "debian")
  echo "determining Linux distribution..."
else
   echo "You have Ubuntu Linux!"
fi

rhflag=$(echo $distro | grep -i "red*hat")
if [ -z "$rhflag" ]
then   #If it is not RedHat, see if it is CentOS or Fedora.
  rhflag=$(echo $distro | grep -i "centos")
  if [ -z "$rhflag" ]
    then    #If it is neither RedHat nor CentOS, see if it is Fedora.
    echo "It does not appear to be CentOS or RHEL..."
    rhflag=$(echo $distro | grep -i "fedora")
    fi
fi

if [ -z "$rhflag" ]
  then
  echo "...still determining Linux distribution..."
else
  echo "You have a RedHat distribution (e.g., CentOS, RHEL, or Fedora)"
  yum install -y java-1.8.0-openjdk-devel
fi

if [ -z "$debflag" ]
then
  echo "...still determining Linux distribution..."
else
   echo "You are using either Ubuntu Linux or Debian Linux."
   apt-get -y update # This is necessary on new AWS Ubuntu servers.
   apt-get -y install openjdk-8-jdk-headless default-jre
fi

suseflag=$(echo $distro | grep -i "suse")
if [ -z "$suseflag" ]
then
  if [ -z "$debflag" ]
  then
    if [ -z "$rhflag" ]
      then
      echo "*******************************************"
      echo "Could not determine the Linux distribution!"
      echo "Installation aborted. Nothing was done."
      echo "******************************************"
      exit
    fi
  fi
else
   group add hadoop
   zypper -n install java java-1_8_0-openjdk-devel
fi

cd /tmp
curl http://apache.claz.org/hadoop/common/hadoop-$version/hadoop-$version.tar.gz > hadoop-$version.tar.gz
tar xzf hadoop-$version.tar.gz
mv hadoop-$version /usr/local/hadoop
echo '
export HADOOP_HOME=/usr/local/hadoop
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin' >> ~/.bashrc
source ~/.bashrc
sed -i '/export JAVA_HOME/c\export JAVA_HOME=/usr/.' /usr/local/hadoop/etc/hadoop/hadoop-env.sh
echo '<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>
<property>
  <name>hadoop.tmp.dir</name>
  <value>/app/hadoop/tmp</value>
  <description>A base for other temporary directories.</description>
</property>

<property>
  <name>fs.default.name</name>
  <value>hdfs://localhost:54310</value>
  <description>The name of the default file system.  A URI whose
  scheme and authority determine the FileSystem implementation.  The
  uri has a scheme that determines the config property (fs.SCHEME.impl) naming
  the FileSystem implementation class.  The uri has authority that is used to
  determine the host, port, etc. for a filesystem.</description>
</property>
</configuration>
' > /usr/local/hadoop/etc/hadoop/core-site.xml

chown -R hduser:hadoop /usr/local/hadoop/

mkdir -p /app/hadoop/
chown -R hduser:hadoop /app/hadoop/

echo 'PATH="$PATH":/usr/local/hadoop/bin/' >> /etc/profile.d/hadoop.sh

echo "Hadoop should now be in /usr/local/hadoop/ directory.  You may want to double check."
hv=$(/usr/local/hadoop/bin/hdfs version)
echo "*****Hadoop version below*****"
echo $hv
echo "*****Hadoop version above*****"
echo "Proceed with the manual steps.  hdfs commands should be run as hduser."
echo "You are done installing Hadoop on this server."

if [ -z "$debflag" ]
then
  echo "You are done installing Hadoop on this server."
else
  rm /etc/profile.d/hadoop.sh  # Not needed for Ubuntu/Debian
  echo "If you do not reboot, and use 'su hduser' you will need to run this command:"
  echo 'PATH="$PATH":/usr/local/hadoop/bin/'
  echo 'Then hdfs commands should work.'
fi

chown -R hduser:hadoop /usr/local/hadoop/logs
echo "********************"
echo "You will have to log out and log back in for the environment variables to work.  Moreover this was configured for the hduser account."
echo "You may want to run 'su hduser' to run the 'hdfs version' command as a test."

5.  Run this command to do the installation:
sudo bash /tmp/install.sh  # You must run script above as a sudoer or the root user.
# You can ignore an error about /usr/local/hadoop/logs not being found if you are using Ubuntu/Debian Linux.

#If you are using the multi-node Hadoop deployment directions on, skip step 6.  Go to this posting if you want to set up a multi-node Hadoop cluster.

6.  For a single-node deployment, log in as hduser (e.g., su hduser).  Run these three commands:

hdfs namenode -format   
bash /usr/local/hadoop/sbin/start-dfs.sh
bash /usr/local/hadoop/sbin/start-yarn.sh
jps # To see the node services that are running

FYI
With Hadoop 3.0.0 the master and slaves files no longer need to be configured.  Instead a workers file needs to be configured.  DataNodes do not need to be able to SSH to the NameNode.  Similarly the DataNodes do not need to be able to SSH to each other.

Do you want to configure a multi-node Hadoop cluster?  See this link for directions.

Leave a comment

Your email address will not be published. Required fields are marked *