Big Data Quiz with Answers

Big Data Quiz

1.  What does EDH stand for?

a.  Enterprise Data Hub
b.  Extract Develop Hadoop
c.  Extract Decide Haul
d.  Extract Data Hadoop

Answer:  a.  Sources:
http://searchbusinessanalytics.techtarget.com/feature/Hadoop-2-YARN-set-to-shake-up-data-management-and-analytics (Previous link used to work.)
https://vision.cloudera.com/practical-uses-of-an-edh/

2.  Gartner, Informatica and MapR think "data lakes" should be referred to as what?

a.  data warehouses
b.  data dams
c.  data mills
d.  data reservoirs

Answer:  d.  Sources:
https://blogs.informatica.com/2015/02/11/data-streams-data-lakes-data-reservoirs-large-data-bodies/
https://mapr.com/solutions/enterprise/marketing-optimization/
https://infocus.emc.com/william_schmarzo/data-lake-data-reservoir-data-dumpblah-blah-blah/

3.  MapReduce is to Hadoop as ___________ is to Spark

a.  Storm
b.  Vertice Algorithm
c.  Directed Acyclic Graph
d.  RDD
e.  Memory

Answer: c.

The DAG is an integral process for Spark.  MapReduce is an integral process of Hadoop.  The quote "MapReduce™ is the heart of Apache™ Hadoop®." was found on IBM's site.  The quote "Each Spark job creates a DAG of task stages to be performed on the cluster."  was found on this HortonWorks site.

See these links for more information:
https://www.quora.com/What-are-the-Apache-Spark-concepts-around-its-DAG-Directed-Acyclic-Graph-execution-engine-and-its-overall-architecture
http://data-flair.training/blogs/dag-in-apache-spark/
http://data-flair.training/blogs/apache-spark-vs-hadoop-mapreduce/

4.  RDD stands for what in Spark?

a.  Really Different Data
b.  Resilient Distributed Dataset
c.  Real Developed Data
d.  Reliable Data Distribution

Answer:  b.  Source:  http://data-flair.training/blogs/apache-spark-rdd-tutorial/

5.  Which three file systems are recommended to be used with HDFS on top?

a.  cifs
b.  ext3
c.  ext4
d.  gfs
e.  hfs
f.  JFS
g.  nfs
h.  reiserfs
i.  vfat
j.  XFS

Answers:  b, c, j  For more information see these sources:
https://www.cloudera.com/documentation/enterprise/5-6-x/topics/install_cdh_file_system.html
https://community.hortonworks.com/articles/14508/best-practices-linux-file-systems-for-hdfs.html

6.  If a Hadoop cluster had nodes that cost $15,000 each, would an HP Vertica or a Teradata solution cost more or less?  Choose two.

a.  HP Vertica would be cheaper
b.  HP Vertica would be more expensive
c.  Teradata would be cheaper
d.  Teradata would be more expensive

Answer:  a and c.  Source:  Page 27 of Managing Big Data Workflow for Dummies by Joe Goldberg and Lillian Pierson published by John Wiley & Sons, Inc in 2016.

7.  What is "a scalable and fault-tolerant stream processing engine built on the Spark SQL engine."?

a.  Structured streaming
b.  Beam
c.  Continual application
d.  Storm

Answer:  a.  For more information see this link https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html.

8.  What is a framework that allows you to implement streaming and batch data processing jobs that can run on any execution engine?

a.  Apache Apex
b.  Apache Beam
c.  Apache Cassandra
d.  Apache Flink
e.  Apache Storm

Answer: b.  See this link https://beam.apache.org/ for more information.

9.  Which of the following does not need Hadoop (choose two)?

a.  Apache Apex
b.  Apache Flink
c.  Apache Spark
d.  Apache Tez

Answers: b. and c.

Why not Apex? "Apex is designed to run in your existing Hadoop ecosystem, using YARN to scale up or down as required and leveraging HDFS for fault tolerance."  https://www.infoworld.com/article/3059284/application-development/look-out-spark-and-storm-here-comes-apache-apex.html

To understand why b. is one correct answer, read this:  "Flink is independent of Apache Hadoop and runs without any Hadoop dependencies."  taken from this external site (https://flink.apache.org/faq.html#how-does-flink-relate-to-the-hadoop-stack) that is no longer up. This link corroborates this answer: https://issues.apache.org/jira/browse/FLINK-4315

To understand why c. is one correct answer, read the following:

Do I need Hadoop to run Spark?
No, but if you run on a cluster, you will need some form of shared file system (for example, NFS mounted at the same path on each node). If you have this type of filesystem, you can just deploy Spark in standalone mode.

This was taken from Apache's website.

Why not Apache Tez?  It requires Hadoop YARN according to this site https://tez.apache.org/.

10.  Which of the following is a "Hadoop YARN native platform" (thus dependent on Hadoop) and a type of "unified stream and batch processing engine"?

a.  Apache Apex
b.  Apache Beam
c.  Apache Cassandra
d.  Apache Delta
e.  Apache Flink

Answer: a.  See this link http://apex.apache.org/ for more information.

11.  What company provides a commercial version of Apache Spark that was founded by the people who invented Apache Spark?

a.  Data Pipeline Gurus, LLC
b.  Databricks
c.  Hotfire Software
d.  Zephyr Data

Answer: b.  The founders of Apache Spark started Databricks (https://www.washingtonpost.com/news/the-switch/wp/2016/06/09/this-is-where-the-real-action-in-artificial-intelligence-takes-place/?utm_term=.ac9e0cea115f).

12.  What is Microsoft's version of Hadoop?

a.  MS Knowledge
b.  BigTable
c.  HDInsight
d.  Datica
e.  Kinesis

Answer: c.  Sources:  http://www.itprotoday.com/microsoft-sql-server/use-ssis-etl-hadoop
 and https://azure.microsoft.com/en-us/services/hdinsight

13. What are examples of a Directed Acyclic Graph?

a. A typical ETL process
b. The npm package manager
c. YARN
d. Spark operating on RDDs via stages which involves sub-tasks
e. Apache Airflow's pythonic schedule of phases for dynamic processing
f. All of the above
g. None of the above

Answer: f. All of the above.

a. because of 1) http://www.cs.uoi.gr/~pvassil/publications/2009_DB_encyclopedia/Extract-Transform-Load.pdf and 2) https://www.d-one.ai/documents/Topological-sorting-and-the-ETL-process-Joonas-Asikainen-D1-Solutions-Zuerich.pdf
b. because of https://medium.com/basecs/spinning-around-in-cycles-with-directed-acyclic-graphs-a233496d4688
c. because of https://medium.com/basecs/spinning-around-in-cycles-with-directed-acyclic-graphs-a233496d4688
d. because of https://data-flair.training/blogs/dag-in-apache-spark/
e. because of https://bigdata-etl.com/apache-airflow-create-dynamic-dag/

14. At what stage in the MapReduce process does the "shuffle" phase happen?

a. Before the map stage
b. After the map stage and before the reduce stage
c. After the reduce stage
d. None of the above

Answer: B. Source page 643 of Cracking the Coding Interview

15. How does Hadoop support high availability for your name node?

a. Via the secondary namenode
b. A standby namenode only in proprietary Hadoop versions
c. A standby namenode in open source or proprietary Hadoop versions
d. N/A. There is no native Hadoop support for highly available namenodes

Answer: C. See https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithNFS.html
Why A is wrong, see http://hadooptutorial.info/secondary-namenode-in-hadoop/

Leave a comment

Your email address will not be published. Required fields are marked *