Update 1/5/18: These directions below are outdated.
If you are looking for the newest directions to install open source Hadoop, click here.
The directions below are left here as a reference for legacy purposes. To install open source Hadoop on an Ubuntu server, see this article.
Previous update was on 11/7/17
Problem scenario
You want to deploy (set up and configure) Hadoop on an Ubuntu server. You are using Ubuntu in a public cloud like Azure or AWS. How do you install the open source version of Hadoop on an Ubuntu Linux server and get it working?
Solution
These "directions" to install Hadoop on Ubuntu Linux include a script and how to run it. The script was designed to install Hadoop 2.8 as a single-node (single server) deployment of Hadoop on an AWS instance of Ubuntu. (For a multi-node deployment, you can use these directions below for each Ubuntu node while skipping steps 8 and 9. For a multi-node deployment, you should be aware of this article.)
(These directions should work on an Ubuntu instance in Azure too.) For AWS this script requires that your AWS Ubuntu Linux server is in a security group that has access to the Internet. The script portion of these directions takes approximately three minutes to run in AWS or approximately nine minutes to run in Azure. But bandwidth and resources on your instance may vary.
1. Log into Ubuntu.
2. Run these interactively as they are not easily or safely scripted:
sudo addgroup hadoop
sudo adduser --ingroup hadoop hduser # Provide a password.
# Suggested name is Hadoop User.
# Accept the defaults thereafter and confirm the information is correct if is correct.
ssh-keygen -t rsa -P "" #Accept the default
sudo cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
ssh localhost #Accept the fingerprint (default)
exit
3. In a directory other than /tmp (as it could be deleted upon reboot), create a script. Use the content below from "#!/bin/bash" to "...hdfs version". Beware that this script could overwrite some files. This script was intended for a new OS with no data or special configuration on it.
#!/bin/bash
#Written by continualintegration.com
apt-get install -y python-software-properties
apt-get install -y default-jre
apt-get install -y ppa
apt-get update
sleep 5
apt-get install -y python-software-properties
apt-get install -y default-jre
apt-get install -y ppa
apt-get install -y openjdk-8-jdk-headless
sleep 2
echo '
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1' >> /etc/sysctl.conf
cd /tmp/
wget http://www-us.apache.org/dist/hadoop/core/hadoop-2.8.1/hadoop-2.8.1.tar.gz
tar -xvf hadoop-2.8.1.tar.gz
mv hadoop-2.8.1 /usr/local/hadoop
cd /usr/local
chown -R hduser:hadoop hadoop
echo 'Here is a backup of .bashrc' >> bashrc.bak
date >> bashrc.bak
cat /root/.bashrc >> bashrc.bak
echo 'export HADOOP_OPTS=-Djava.net.preferIPv4Stack=true' >> /usr/local/hadoop/etc/hadoop/hadoop-env.sh
echo '
export HADOOP_HOME=/usr/local/hadoop
unalias fs &> /dev/null
alias fs="hadoop fs"
unalias hls &> /dev/null
alias hls="fs -ls"
lzohead () {
hadoop fs -cat $1 | lzop -dc | head -1000 | less
}
export PATH=$PATH:$HADOOP_HOME/bin
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export HADOOP_INSTALL=/usr/local/hadoop
export PATH=$PATH:$HADOOP_INSTALL/bin' >> /root/.bashrc
mkdir -p /fs/hadoop/tmp
chown hduser:hadoop /fs/hadoop/tmp
chmod 750 /fs/hadoop/tmp
sed -i '/${JAVA_HOME}/c\export JAVA_HOME=/usr/.' /usr/local/hadoop/etc/hadoop/hadoop-env.sh
echo'
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export HADOOP_HEAPSIZE=272' >> /usr/local/hadoop/etc/hadoop/hadoop-env.sh
echo '<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/fs/hadoop/tmp</value>
<description>A base for other temporary directories.</description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:54310</value>
<description>The name of the default file system. A URI whose
scheme and authority determine the FileSystem implementation. The
uri has a scheme that determines the config property (fs.SCHEME.impl) naming
the FileSystem implementation class. The uri has authority that is used to
determine the host, port, etc. for a filesystem.</description>
</property>
</configuration>
' > /usr/local/hadoop/etc/hadoop/core-site.xml
echo 'Run two more steps.
Run this command: PATH="$PATH":/usr/local/hadoop/bin/
Then try this: hdfs version
'
4. sudo bash nameOfscript.sh # You must run script above as a sudoer or the root user.
5. Run this command: PATH="$PATH":/usr/local/hadoop/bin/
6. Enter this to command and press enter: hdfs version
7. Optional step. If you do not mind any user being able to run hadoop commands, do the following:
i. Go to /etc/profile.d/
ii. vi hadoop.sh
iii. PATH="$PATH":/usr/local/hadoop/bin/
iv. Save the changes.
v. If you are deploying a multi-node cluster of Hadoop, do not do the steps below because you are done with these directions. There will be more to configure for a multi-node cluster, but they are beyond the scope of these directions.
8. This step is only necessary for a single-node deployment of Hadoop. To get all the hdfs commands to work, one way would be to allow root to ssh to the local server and run "sudo bash" to kick off scripts to start the Hadoop daemons. There are other ways, but here is how to do it when a reliance on the sudo user:
# This assumes you are the ubuntu user. The first directory in the "cat" command below would be different if you are not the "ubuntu" user.
sudo su -
ssh-keygen -t rsa -P "" #press enter to the default prompt.
cat /home/ubuntu/.ssh/id_rsa.pub >> /root/.ssh/authorized_keys
cat /root/.ssh/id_rsa.pub >> /root/.ssh/authorized_keys
ssh root@localhost
ssh root@localhost #do it again
exit # the second SSH session
exit # the first SSH session
exit # Go back to a non-root user
9. This step is only necessary for a single-node deployment of Hadoop. Complete the configuration and start the NameNode service:
sudo /usr/local/hadoop/bin/hdfs namenode -format
sudo bash /usr/local/hadoop/sbin/start-dfs.sh
Update 1/5/18: These directions above are outdated
Click here for the newest directions to install open source Hadoop an Ubuntu Linux server.
The directions above are left here as a reference for legacy purposes. To install open source Hadoop on an Ubuntu server, see this article.