DevOps ETL Quiz Answers
1. What is the devops tool for databases?
Answer: D. DBMaestro
2. What does mung mean?
Answer: "Mash until no good." It refers to a permanent operation of data wherein changes will make the data irretrievable. An example would be to remove the commas from a .csv file. With no substituted symbol, it will not be trivial to restore the file to its original location. Puppet error messages use the word "mung." Data cleansing in ETL operations can lead to files being unrestorable.
3. What does idempotent mean?
Answer: It is an adjective describing an action to have no cumulative affect. For example, running an idempotent operation one time will be the same as running it ten times. Sometimes ETL operations are performed multiple times. If the job has idempotency, no harm can be done in running it again. In the DevOps world, whether or not a configuration management tool operation is idempotent will determine if DevOps engineer must be mindful of inadvertent accumulation of multiple runs of the operation itself.
Appending a line to an /etc/fstab (a file system table file) may be desirable one time, but additional appends could be undesirable. Such an operation is not idempotent because the appending is cumulative. Enforcing a configuration in an /etc/hosts file (with erasure and substitution) may happen one million times with no negative consequences. Such an operation is idempotent because there is erasure.
4. What is the name of the process of actively preparing data for serialization (e.g., data that was not otherwise logically contiguous on disk for a buffer) called? This process may include modifying data from one programming language or interface so it is compatible with a different programming language or different interface.
a. Almquist variation
c. scrum transition
5. How is an imperative process different from a declarative process?
Answer: ETL processes are almost always imperative. Imperative processes can involve math and be procedurally intensive. An imperative ETL process would involve data being cleansed, organized, filtered, extracted, transformed and loaded into a database. Imperative configurations rely on how a given server is found and involve an order of execution. The starting point is relevant to the configuration process and programming with intermediate steps may be part of an imperative process.
Declarative processes usually avoid math and computation. Many configuration management tools have desired states. The end configuration is declared and idempotent operations make changes are made on destination servers. Declarative configurations do not ordinarily rely on how a given server is configured. A declarative process may map an object to its ultimate state. Some ETL solutions are advertised as being declarative; in 2016, it seems impossible that the mapping could truly be ETL (extract, transform, and load) and declarative.
6. What is a common tool that both ETL Developers and DevOps Engineers use?
In Linux, they use the crontab and Bash.
In Windows, they use Task Scheduler and PowerShell.
Jenkins can be a tool for DevOps Engineer and ETL Developers (scheduling and monitoring jobs). Jenkins is more commonly associated with Build Engineering or Release Engineering.
7. Which of the following can you not create an AWS Data Pipeline with?
a. AWS Management Console
b. AWS Command Line Interface
c. AWS SDKs
d. AWS APIs
e. None of the above
Answer: E. Taken from http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/what-is-datapipeline.html
8. Mesos Clusters cannot work with both HDFS and Digital Ocean?
Answer: False. See this link for more info: https://www.digitalocean.com/customers/rockerbox/
9. Hadoop YARN cannot act as a scheduler for OpenShift?
Answer: False. See this link. http://hortonworks.com/blog/openshift-v3-kubernetes-docker-hadoop-yarn/
10. Which of the following Apache products can create ETL jobs?
Answer: B. See this link: http://hortonworks.com/apache/pig/
11. Which of the following is not an ETL product?
a. IBM InfoSphere Datastage
b. Oracle Warehouse Builder
c. Business Objects XI
d. SAS Enterprise ETL server
g. Apache Hadoop
h Talend Big Data Integration
Answer: E. Apache Stratos is not an ETL product. See this for more info about Hadoop:https://www.datanami.com/2014/09/01/five-steps-to-running-etl-on-hadoop-for-web-companies/
12. In Informatica are mapplets only able to be used once without logic?
Answer: No. "Mapplets are reusable objects with transformations and logic very similar to a traditional mapping." Taken from http://www.howtointegratedata.com/mapplet-in-informatica/ (as of 4/8/20, this link no longer works)
13. Which of the tools below are designed to aide ETL process testing and validating data warehouses themselves?
a. QuerySurge by Real-Time Technology Solutions
c. Apache Cassandra
d. Apache Stratos
e. ETL Validator by datagaps inc.
Answer: A. QuerySurge. To see more info, see this link:
14. What is an example of cooked data in the context of ETL/Devops?
a. Machine-corrupted data (e.g., from disk failure)
b. Content that was corrupted maliciously
c. Cleansed data
d. Intentionally masked data (to hide identities)
15. What is the technique that divides a table of a database into different subcomponents, such as partitioning columns, to improve read and write performance?
a. data marting
b. impedance matching
16. What tool allows you to designate when Docker containers process ETL jobs without manual configuration?
Answer: B. For more information, see http://www.midvision.com/blog/5-must-see-docker-big-data-use-cases-that-show-dockers-processing-power
17. Which of the following can readily be used as a superior ETL platform?
d. Note Beak
Answer: B. The source that had more information was http://devops.sys-con.com/node/2079447 (but as of 4/8/20 it no longer works).
18. There is consensus that small companies should use Informatica or a supported, proprietary ETL tool as opposed to an in-house developed tool.
Answer: False. For more information, see https://oraclesponge.wordpress.com/2006/12/20/a-list-ten-reasons-not-to-use-an-external-etl-tool/
19. Which of the following has an open source version:
a. Talend Integration Suite
b. Pentaho Kettle Enterprise
d. All of the above
e. None of the above
20. What is a data lake?
a) A synonym of data warehouse
b) A buffer of streamed data
c) An archive of metadata about previous real-time data streams
d) A pool of unstructured data
21. What is a data swamp?
a. A dense data lake
b. A severely degraded data lake
c. A synonym of a data warehouse
d. A pool of unstructured data
e. An archive of metadata about previous real-time data streams
Answer: B. See this link for more information.
22. Snappy is the name of which two concepts?
a. The REST API for SnapChat
b. A data compression and decompression library with bindings for several languages
c. A Linux package management system
d. An automation scheduler for Informatica
e. An open source component to migrate SSIS packages to PostgreSQL
Answer: B and C. For B, this link discusses python-snappy (something related to [Python] development). For B, this link delves into compression and decompression for the same tool which could pertain to an ETL task. For C, this link explains how systems administrators could use a different tool called Snappy (not related to the one in B).
23. In a SQL database you have a left table with four rows and a right table with seven rows, what is the highest number of rows that can be returned with an inner join?
d. More than 11
Answer: D. If the join is on a keys that do not have a unique constraint, the inner join could produce more than 11 rows. You can try this as a demonstration.
24. Which of the following provide Sqoop based connectors (choose all that apply)?
b. Talend Open Studio
c. Informatica (modern versions)
Answer: A, B, C, D
To see the reason behind each option, click on the link corresponding to the letters:
25. What is a continuous application?
a. The namesake of CA traded on the Nasdaq as CA
b. An application that encompasses data streaming (e.g., ETL processes) from start to finish that adapts itself to the data stream(s) in real-time
c. An application that leverages ETL processing
d. An application receiving continuous integration (or continual integration)
e. An application receiving continous delivery (or continual delivery)
f. An application receiving continuous deployments (or continual deployments)
g. An application that is always available through fault tolerance and load balancing
Answer: B. See this link for more information.
26. DevOps expert Gene Kim got his start with a security product called Tripwire, known for its emphasis on changes to files. There is a tool that keeps track of changes to a database. Which product below concerns itself with tracking changes of database schemas?
Answer: D. To learn more, see this external site.
27. Which product enables you to quickly make copies of SQL Server databases for your Test, QA or development environments? Choose the most accurate answer.
a. Canonical's Juju
b. RedGate's SQL Provision
c. Apache Hamster
d. Apache Numa
Answer: B. To read more, see this external site.
28. The SQL Server database back ups are not working or you get false positives that your back up solution is successfully backing them up. What solution should you for a practical back up solution?
a. Write you own PowerShell script that backs up the database
b. Implement AlwaysOn Availability Groups
c. Implement RedGate's Toolbelt
d. Implement Apache Impala
Answer: C. You might not want to implement live clustering depending on your needs and budget. Plus for backups you want a manageable solution for the files to put on to tape drive or to store in Amazon Glacier. So we believe the canonical answer to this question is C. You could write your own PowerShell scripts, but this would not be supported; it could be very involved to make sure it was reliable. To read more about RedGate's Toolbelt, see this page.