How Do You Optimize Data Warehouse Performance?

Problem scenario
You have a data warehouse that has not been performing as well as you would like.  What are some things that you can do to optimize the performance?

1.  Schedule processes such as back ups or ETL jobs at different times from hours of peak demand.  Rebuilding indexes can be a good idea or a bad one depending on when it is performed.  Be mindful of when statistics (meta data related to database transactions) are updated.  Transactions to a database involve locks.  In memory constrained situations, granular cells in tables cannot be locked so an "lock escalation" happens.  This involves more rows or even entire tables being locked and inoperable by other processes.  For a primer on avoiding database lock contention, see this external article.

It may make sense to have a time window for certain batch processing to take place.  Ad hoc tasks can happen inside or outside certain windows.

2.  Use indexes that are neither too complex nor too simple. Some indexes in an effort to make them "covered" (hold each row that a given query needs) become too complex for the associated queries.  Queries are not as fast as they otherwise would be.  The indexes should be made for the most demanding (e.g., frequent) queries.  This may mean creating a composite key for the uniqueness of the key.  If the indexes are general purpose, ensure the key is on a unique row.  Data warehouses are usually not subject to regular OLTP activity.  So they are good candidates for indexes.

3.  Ensure that where clauses filter out unnecessary rows.  Some analytical processing does not need to fetch any more rows than a few.  This is obvious, but before investing large amounts of money in optimization, it is worth mentioning.  There can be greater atomicity between queries when greater filtering is performed.

4.  Model the data in a way that makes sense.  Redesigning table schemas is not a small task.  Remember that for data warehouses, depending on how much analysis and ETL activity there is, you should normalize and denormalize as appropriate.  Some tables perform better with reads if they are denormalized.  OLAP presentations can do well with unnecessary tables (e.g., those that are redundant) that are denormalized.  Analysis can use the entire table.  This can make analytical development easier too.  But creating tables with redundant data can add to the disk space.   Normalization is useful if there are many different types of analysis.  OLAP queries may not cause disk contention if they draw data from different tables that are mutually exclusive.  Avoiding concurrency problems is key.  Some data warehouses span multiple disks and thus highly normalized tables are more likely to quickly support independent queries simultaneously.

5.  Use data types that are appropriate and no larger than they need to be.  While some data types or covariant (compatible) with another or contravariant (not compatible), for performance reasons, it is best to be consistent with data types for evaluation and comparison purposes.  "Mixing different data types can cause performance problems, and even if it doesn’t, implicit type conversions during comparisons can create hard-to-find errors. " taken from this external site.

SQL Server 2008 introduced a data type called datetime2.  This was strictly smaller on the disk compared to the previous data type called datatype.  Surprisingly datetime2 could hold greater accuracy of subseconds.  Some database tables use suboptimal data types for the columns because they started as a proof-of-concept or prototype.  Analyze the data types and the business' requirements to accommodate no more than what is needed.  In other words for choosing a column's data type support the longest string that is required but no longer than that.  Find the most simple as possible numeric data type for the business' required level of precision.  Reads and writes to such a column will be faster than if you chose a data type that supported many decimals and negative values when you only needed positive integers.  This practice of choosing smaller (few characters in a string) and simpler data types will lead to the most efficient operations down the road.

6.  Use the NOT NULL constraint on columns when you can.  "It’s harder for MySQL to optimize queries that refer to nullable columns," taken from this external site.

7.  Adjust and possibly reduce the logging*. Some transactions are logged in such detail that the transactions slow things down.  Tools like Change Data Capture can be great, but not for sheer speed.  Transaction logging is more complex than "either a SQL statement is logged or not."  TRUNCATE is not logged the same was as DELETE.  You can eliminate all the rows in a table quickly with the TRUNCATE command.  To learn about the differences between DELETE and TRUNCATE as people find TRUNCATE to be faster, see this external posting

DDL commands can be logged.  You may want to determine if this is necessary.  We recommend  keeping logging settings the same in all environments (from development to production) because parity in the development pipeline is important.

8.  Reconfigure the underlying storage RAID that supports the data warehouse.  Sometimes  redundancy is not necessary given the transaction logging*.  Disk controllers and disk writing algorithms can be bottlenecks for importing data in the warehouse.

9.  Convert the underlying file system to something that is more minimalistic and fast.  Journaled file systems do their own transaction logging.  This is not necessary if the transaction logging the database does is on a different hard disk from the actual database or data warehouse itself.  The closer to bare metal you place the data warehouse on, the greater speed the I/O activity will be.

10.  Leverage views.  These can provide all the rows that a given query or group of queries will need.  Views exist in memory, so they are fast.  This can free up the disk to serve other I/O activity.

11.  Consider upgrading the database engine.  As new versions of databases are released, the database engines can be perform much faster.

12.  Consider using data compression.  This can enhance reads and writes.  It may be appropriate for a data warehouse.  Not every database can supports compression.  This can also reduce the footprint on the disk too.

13.  Real-time replication of the data warehouse may be unnecessary.  Regularly scheduled back ups may be enough.  Sometimes data warehouses are an after-thought and combined with OLTP databases on a given instance.  Depending on your business' needs, replicating other databases may make sense but the data warehouse could be spared.  On the other hand, you want want to balance a load by having a highly-available data warehouse solution in different geographic locations.

14.  Rewrite SQL queries and remember big O notation.  Some SQL statements use O (n^2) computational complexity.  If you do not know what this is, do two things: One, make sure your SQL statements to not do more comparisons and intermediate processing than absolutely necessary.  Two, consider worst case scenarios where large numbers of rows are returned and then compared.  You may need to rewrite logic to reduce the number of rows being compared or manipulated.  Some I.T. professionals prefer to process data with a high level programming language and use databases and data warehouses as basic data store.  If your SQL statements do the processing themselves, you may want to analyze your SQL code carefully.  It may make sense to use Java or Python for the data processing instead of SQL.  If you must stick with rewriting SQL statements, which can be faster than using a traditional, high level programming language, you may want to read this external guide to identifying which SQL statements you will target in your optimization process.

15.  Can data warehouse operations be performed in-memory?  In-memory databases are becoming more common.  Some public clouds have RAM that is available at a low price.  Memory input and output is faster than disk input and output.

16.  Most databases support various types of prioritization.  Designations called "degree of parallelism" specify a dedicated number of CPUs for a process.  Database engines can take hints which can remove a SQL statement from the native optimizer leading to undesirable results.  This is related to scheduling of back ups and reindexing.  But it is a different process altogether.
The trade-offs in prioritizing can be more clear than other optimization techniques.  Keep a global perspective if you are not the only administrator.

17.  Be aware of what is cached and what is not cached in a database.  See this external link about Postgres caching if you have time.  The cache behavior will vary from database system to database system.  Manually manipulating what is cached can help you.  But keep a global perspective as some processes automatically cached and predicted by increasingly sophisticated tools with every new generation of database version.  

*** It is highly recommended to analyze the queries and use other methods above before making hardware upgrades.  ***

18.  Optimize your network.  Using multiple NICs and multiple network paths (through cabling and routers) for the data warehouse server.  This can ensure that there is no TCP/IP bottleneck.  TCP/IP network collisions degrade the performance of a network.  Spikes in bandwidth attributable to business activity or DDoS attacks can leave you susceptible to reduced performance.  You may want to analyze your entire network for optimal performance and reliability.

19.a.  Add more RAM or use faster RAM (with a higher bus speed).  Available memory has a way of getting consumed.  This can enable you to use in-memory databases or keep operations from paging or swapping (writing to disk via virtual memory).  
19.b. In some cases you have enough disk space to warrant increasing the virtual memory at your disposal. To increase the swap space of the operating system, see this link.

20.  Upgrade the CPUs to have more cores, faster speeds (higher GHz), larger caches and faster caches (e.g., L2 instead of L3). If you can afford the higher costs, click here if your data warehouse is running on a server in AWS or click here if the server is in GCP.

21.  Migrate to a faster storage device (e.g., some SANs and NASes perform better than others).

22. From a pragmatic perspective, many SQL databases include data warehouses. If you want advice on optimizing a SQL database, see this posting.

*  SQL Server has Transaction Logs, Oracle Databases have Redo Logs, Postgres has Write-Ahead-Logs, and MariaDB has binary logs.  To learn more about transaction logging, see this Wikipedia article.

Leave a comment

Your email address will not be published. Required fields are marked *