What is Apache Parquet?

Question
What is Apache Parquet?

Answer
Apache Parquet is columnar data representation/manipulation tool for a Hadoop ecosystem.  Data in a given column is largely uniform (e.g., a long string of characters, a single character, or an integer) in that it repeats a specific type and format of data as opposed to two cells in the same row (which may be very dissimilar types of data).  Columns may be in memory or disk -- just as relational rowset data is.  The columns will be more compressible (in RAM or on disk) due to the underlying data uniformity than rows themselves.  If you want to learn more about columnar databases, see this link.

The term parquet refers to "diminutive" (as in a compartment) in French. The open-source Apache program Parquet was indeed created by a French-speaking person. In English it is pronounced "par-kay" (like "par" for the course and "cay" as in the "Cayman Islands").  Also in English parquet can refer to long flooring strips (which physically resemble vertical columns).  If you want to learn more about Parquet, see this PDF file

Literature about Apache Parquet (like the link above or on Apache's website) refer to the Dremel paper.  What is the Dremel paper?  According to citations in a Wired article we believe that "the" Dremel paper is this one.  Dremel is not a person. "Dremel is a scalable, interactive ad-hoc query system for analysis of read-only nested data."  (This quote was taken from the 10-page PDF linked above.)  If you want to install the open source version of Apache Parquet, see this article.

Leave a comment

Your email address will not be published. Required fields are marked *