How Do You Copy a File into HDFS without the Error “No such file or directory”?

Problem scenario
You want to add a file to Hadoop. You are trying to run a basic Hadoop command to copy a file into HDFS.  You get this error:  copyFromLocal: `hdfs://localhost:54310/user/…’: No such file or directory

How do you copy a file from your OS into HDFS?

Solution
Do one of the following:
Option 1.  Run this command to create a new directory (substitute “jdoe” with the name of your user):

hdfs dfs -mkdir -p /user/jdoe/contint
# Now repeat your copy command

Option 2. 

How Do You Troubleshoot a Fatal HDFS Error?

Problem scenario
You run an hdfs command and you get this:

[Fatal Error] core-site.xml:2:6: The processing instruction target matching “[xX][mM][lL]” is not allowed.
17/09/25 04:21:00 FATAL conf.Configuration: error parsing conf core-site.xml
org.xml.sax.SAXParseException; systemId: file:/home/hadoop/hadoop/etc/hadoop/core-site.xml; lineNumber: 2; columnNumber: 6; The processing instruction target matching “[xX][mM][lL]” is not allowed.
        at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
        at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source)
        at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:150)
        at org.apache.hadoop.conf.Configuration.parse(Configuration.java:2531)
       

How Do You Install R on a RHEL Instance of AWS?

Problem scenario
You have a RedHat Enterprise version of Linux.  You want to install the R programming language.  What do you do?

Solution
1.  Run these six commands:

sudo yum -y install wget
cd /tmp
wget ftp://rpmfind.net/linux/epel/7/x86_64/e/epel-release-7-10.noarch.rpm
sudo rpm -ivh epel-release-7-10.noarch.rpm
sudo yum-config-manager –enable rhui-REGION-rhel-server-extras rhui-REGION-rhel-server-optional
sudo yum -y install R

2.  This is an optional step to complete these instructions. 

How Do You Connect to Your Apache Spark Deployment in AWS?

Problem scenario
You have recently deployed Apache Spark to AWS.  You see the EC-2 instances were created.  But you cannot access them over the web UI (even over ports 4140, 8088, or 50070).  You cannot access the instances via Putty.  You changed your normal Security Group to allow TCP communication from your work station’s IP address.  What should you do to connect to your new Spark instance for the first time?

How Do You Deploy an Apache Spark Cluster in AWS?

Problem scenario
You want to deploy Apache Spark to AWS.  How do you do this?

Solution
1.  Log into the AWS management console.  Once in, go to this link.

2.  Click “Create cluster” and then “Quick Create”

3. For Software Configuration, choose “Spark:…”

4. For “Security and access”, for EC2 key pair, choose the key pair you desire.

A List of Hadoop Books

Advanced Analytics with Spark: Patterns for Learning from Data at Scale by Sandy Ryza, Uri Laserson, Sean Owen and Josh Wills
Agile Data Science: Building Data Analytics Applications with Hadoop by Russell Jurney
Apache Drill: The SQL query engine for Hadoop and NoSQL by Ted Dunning, Ellen Friedman, Tomer Shiran and Jacques Nadeau
Apache Flume: Distributed Log Collection for Hadoop -Second Edition by Steve Hoffman
Apache Hadoop YARN: Moving beyond MapReduce and Batch Processing with Apache Hadoop 2 (Addison-Wesley Data &

A List of Apache Spark Books

99 Apache Spark Interview Questions for Professionals by Yogesh Kumar
Advanced Analytics with Spark: Patterns for Learning from Data at Scale by Juliet Hougland, Uri Laserson, Sean Owen, Sandy Ryza and Josh Wills
Apache Spark 2 for Beginners by Rajanarayanan Thottuvaikkatumana
Apache Spark in 24 Hours, Sams Teach Yourself by Jeffrey Aven
Apache Spark for Data Science Cookbook by Padma Priya Chitturi
Apache Spark for Java Developers by Sumit Kumar and Sourav Gulati
Apache Spark Graph Processing by Rindra Ramamonjison
Apache Spark Interview Question &