Big Data Quiz
1. What does EDH stand for?
a. Enterprise Data Hub
b. Extract Develop Hadoop
c. Extract Decide Haul
d. Extract Data Hadoop
Answer: a. Sources:
http://searchbusinessanalytics.techtarget.com/feature/Hadoop-2-YARN-set-to-shake-up-data-management-and-analytics (Previous link used to work.)
https://vision.cloudera.com/practical-uses-of-an-edh/
2. Gartner, Informatica and MapR think "data lakes" should be referred to as what?
a. data warehouses
b. data dams
c. data mills
d. data reservoirs
Answer: d. Sources:
https://blogs.informatica.com/2015/02/11/data-streams-data-lakes-data-reservoirs-large-data-bodies/
https://mapr.com/solutions/enterprise/marketing-optimization/
https://infocus.emc.com/william_schmarzo/data-lake-data-reservoir-data-dumpblah-blah-blah/
3. MapReduce is to Hadoop as ___________ is to Spark
a. Storm
b. Vertice Algorithm
c. Directed Acyclic Graph
d. RDD
e. Memory
Answer: c.
The DAG is an integral process for Spark. MapReduce is an integral process of Hadoop. The quote "MapReduce™ is the heart of Apache™ Hadoop®." was found on IBM's site. The quote "Each Spark job creates a DAG of task stages to be performed on the cluster." was found on this HortonWorks site.
See these links for more information:
https://www.quora.com/What-are-the-Apache-Spark-concepts-around-its-DAG-Directed-Acyclic-Graph-execution-engine-and-its-overall-architecture
http://data-flair.training/blogs/dag-in-apache-spark/
http://data-flair.training/blogs/apache-spark-vs-hadoop-mapreduce/
4. RDD stands for what in Spark?
a. Really Different Data
b. Resilient Distributed Dataset
c. Real Developed Data
d. Reliable Data Distribution
Answer: b. Source: http://data-flair.training/blogs/apache-spark-rdd-tutorial/
5. Which three file systems are recommended to be used with HDFS on top?
a. cifs
b. ext3
c. ext4
d. gfs
e. hfs
f. JFS
g. nfs
h. reiserfs
i. vfat
j. XFS
Answers: b, c, j For more information see these sources:
https://www.cloudera.com/documentation/enterprise/5-6-x/topics/install_cdh_file_system.html
https://community.hortonworks.com/articles/14508/best-practices-linux-file-systems-for-hdfs.html
6. If a Hadoop cluster had nodes that cost $15,000 each, would an HP Vertica or a Teradata solution cost more or less? Choose two.
a. HP Vertica would be cheaper
b. HP Vertica would be more expensive
c. Teradata would be cheaper
d. Teradata would be more expensive
Answer: a and c. Source: Page 27 of Managing Big Data Workflow for Dummies by Joe Goldberg and Lillian Pierson published by John Wiley & Sons, Inc in 2016.
7. What is "a scalable and fault-tolerant stream processing engine built on the Spark SQL engine."?
a. Structured streaming
b. Beam
c. Continual application
d. Storm
Answer: a. For more information see this link https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html.
8. What is a framework that allows you to implement streaming and batch data processing jobs that can run on any execution engine?
a. Apache Apex
b. Apache Beam
c. Apache Cassandra
d. Apache Flink
e. Apache Storm
Answer: b. See this link https://beam.apache.org/ for more information.
9. Which of the following does not need Hadoop (choose two)?
a. Apache Apex
b. Apache Flink
c. Apache Spark
d. Apache Tez
Answers: b. and c.
Why not Apex? "Apex is designed to run in your existing Hadoop ecosystem, using YARN to scale up or down as required and leveraging HDFS for fault tolerance." https://www.infoworld.com/article/3059284/application-development/look-out-spark-and-storm-here-comes-apache-apex.html
To understand why b. is one correct answer, read this: "Flink is independent of Apache Hadoop and runs without any Hadoop dependencies." taken from this external site (https://flink.apache.org/faq.html#how-does-flink-relate-to-the-hadoop-stack) that is no longer up. This link corroborates this answer: https://issues.apache.org/jira/browse/FLINK-4315
To understand why c. is one correct answer, read the following:
Do I need Hadoop to run Spark?
This was taken from Apache's website.
No, but if you run on a cluster, you will need some form of shared file system (for example, NFS mounted at the same path on each node). If you have this type of filesystem, you can just deploy Spark in standalone mode.
Why not Apache Tez? It requires Hadoop YARN according to this site https://tez.apache.org/.
10. Which of the following is a "Hadoop YARN native platform" (thus dependent on Hadoop) and a type of "unified stream and batch processing engine"?
a. Apache Apex
b. Apache Beam
c. Apache Cassandra
d. Apache Delta
e. Apache Flink
Answer: a. See this link http://apex.apache.org/ for more information.
11. What company provides a commercial version of Apache Spark that was founded by the people who invented Apache Spark?
a. Data Pipeline Gurus, LLC
b. Databricks
c. Hotfire Software
d. Zephyr Data
Answer: b. The founders of Apache Spark started Databricks (https://www.washingtonpost.com/news/the-switch/wp/2016/06/09/this-is-where-the-real-action-in-artificial-intelligence-takes-place/?utm_term=.ac9e0cea115f).
12. What is Microsoft's version of Hadoop?
a. MS Knowledge
b. BigTable
c. HDInsight
d. Datica
e. Kinesis
Answer: c. Sources: http://www.itprotoday.com/microsoft-sql-server/use-ssis-etl-hadoop
and https://azure.microsoft.com/en-us/services/hdinsight
13. What are examples of a Directed Acyclic Graph?
a. A typical ETL process
b. The npm package manager
c. YARN
d. Spark operating on RDDs via stages which involves sub-tasks
e. Apache Airflow's pythonic schedule of phases for dynamic processing
f. All of the above
g. None of the above
Answer: f. All of the above.
a. because of 1) http://www.cs.uoi.gr/~pvassil/publications/2009_DB_encyclopedia/Extract-Transform-Load.pdf and 2) https://www.d-one.ai/documents/Topological-sorting-and-the-ETL-process-Joonas-Asikainen-D1-Solutions-Zuerich.pdf
b. because of https://medium.com/basecs/spinning-around-in-cycles-with-directed-acyclic-graphs-a233496d4688
c. because of https://medium.com/basecs/spinning-around-in-cycles-with-directed-acyclic-graphs-a233496d4688
d. because of https://data-flair.training/blogs/dag-in-apache-spark/
e. because of https://bigdata-etl.com/apache-airflow-create-dynamic-dag/
14. At what stage in the MapReduce process does the "shuffle" phase happen?
a. Before the map stage
b. After the map stage and before the reduce stage
c. After the reduce stage
d. None of the above
Answer: B. Source page 643 of Cracking the Coding Interview
15. How does Hadoop support high availability for your name node?
a. Via the secondary namenode
b. A standby namenode only in proprietary Hadoop versions
c. A standby namenode in open source or proprietary Hadoop versions
d. N/A. There is no native Hadoop support for highly available namenodes
Answer: C. See https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithNFS.html
Why A is wrong, see http://hadooptutorial.info/secondary-namenode-in-hadoop/