For Storing Data for with Few Transactions and Little Loading with a Need for Compression, Would a Column-Oriented Database Be Better than a Row-Oriented Database?

Problem scenario
You are trying to select a database that is right for your needs. You will not be doing many transactions. The database does not need to support regular ETL or OLTP. You are going to have a large database, and you want it to compress to the smallest size possible. In some cases a column-oriented database better than a row-oriented database. What type of database should you use for this situation?

Solution
A columnar database. Here are the three reasons:

  1. A single disk-read to retrieve a row is achievable with smaller tables. With very large amounts of data however, some tables in row-oriented databases need to be partitioned. Columnar databases scale to large sizes more easily *.
  2. Columns have uniform data types; this is optimal for compression. Column-oriented databases leverage run-length encoding **. Run-length encoding is like deduplication conceptually. Run-length encoding will store a sequence (or series) of data, and such sequences can be stored by their count.
  3. As there are no ETL jobs (per the problem scenario), a columnar database is superior. ETL jobs involve adding rows. Adding rows is inefficient for column-oriented databases because you have to write to many different columns. OLTP activity is better supported by row-oriented databases ***. If you will not be doing many ETL jobs or online transactional processing, column-oriented databases have benefits that row-oriented databases do not have.

* https://www.predictiveanalyticstoday.com/top-wide-columnar-store-databases/

** https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/AnalyzingData/Optimizations/UsingRunLengthEncodingRLEToImproveQueryPerformance.htm

*** https://www.omnisci.com/technical-glossary/columnar-database


FFR

  • To learn more about run-length encoding, see https://stackabuse.com/run-length-encoding/
  • Low cardinality in columns (meaning that values were often repeated) would increase the propensity to choose a columnar database because of run-length encoding.
  • ClickHouse is a column-oriented database, and it is very fast for many use cases (taken from clickhouse.com) . The company that maintains it recently (in 2021) obtained a large amount of funding (taken from this external website). As of January of 2022, Alibaba, Bloomberg, Microsoft, and Tesla have adopted ClickHouse (according to this external page).

Leave a comment

Your email address will not be published. Required fields are marked *