Problem scenario
The primary purposes of logging include troubleshooting (root cause analysis of poor performance, debugging unintended behavior, or resolving catastrophic failures). In some cases logging is used for monitoring of resource utilization and planning of changes. What patterns or characteristics of a good logging system (consistent with what some may phrase as "best practices")?
Solution
Here are 13 traits of good logging.
1. Use appropriate file systems and tune them based on your needs. You will want to use journaled file systems. ext4 is the canonical best choice.*
How you adjust the journaling can depend on a variety of factors. For high performance OLTP databases, the transactions will likely be logged in the database itself. Thus logging the file system changes through adjusting the journaling in a verbose way is probably not necessary if the server will not be tuned or the data on it is being backed up. Sometimes logging a file system can help determine the root cause of server crashes.
ext4 is an ideal file system for databases because it is good with big files (according to Red Hat's website). A relational database will often be a big file on a hard disk. ext4 uses extents and not block mapping (according to Red Hat's website).
With today's world of JSON and XML files and key-value pairs becoming enormously popular, you may want to adjust the logging so every change to a file is tracked. If there is a failure there may be some record of the intent of the change for the file.
2. Use NTP. Subsecond timestamps are a feature of ext4. If the clock times of different servers are discrepant, correlating events can be impossible. Good logs on different servers are logs that can be temporally correlated.
3. Use file disk redundancy. Ideally logs will be on two different disks. With public cloud offerings, there should be different data centers to get logs. Everything from having different data centers to different disks on a server can be useful. This is a cost-benefit decision that should be thought out. Unfortunately some theoretical aspects of RAIDs do not always apply. We have found that RAID 5 can be appropriate for frequent but small I/O activity -- despite the theoretical overhead of striping across several disks. (RAID 5 is fairly universal in acceptance for its performance benefit for reading and writing large files.) But some fundamentals of rapid, small reads and writes to disks can apply. If you have the resources, test different RAID configurations for performance. You do not want to create a performance bottleneck because the logging process is too slow.
4. Ideally logs will be handled as streams of events. This is one of the 12 factors of 12 Factor Apps. Remotely storing the logs (e.g., asynchronously) is ideal for postmortems and disaster recovery. Loggly.com recommends writing logs asynchronously.
Asynchronous logging (without the constraint of an intent log) can enable faster performance of given operations. Operations bound by intent logs will be hindered, but this tradeoff is something only a specific professional or team can make. Correlating the event data on an ad hoc basis can help your organization diagnose and correct problems as well as tune your system for performance.
Databases, middleware, sidecar containers, an Apache streaming tool (such as Flink) can all help process logs as streams of events. If someone is developing a new application, the designer may consider designing the log data to be written to stdout in the JSON format. Support for the serial processing of JSON as an interchange format appears to be growing in popularity. To read about how The Twelve-Factor App recommends logs be written, view this page. Loggly recommends using the JSON format. If you do not choose JSON, you may want to consider making the logs readable in text format with delimiters to enable AWK and grep to rapidly parse them.
Handling logs as a stream lends itself to redundancy (as the source of the logs can be piped to two different servers). Streams of events can enable statistical operations for anomaly detection. Sophisticated triggering can help a system automatically scale to address an increased workload or harden the underlying data by denying service. Apache Metron leverages telemetry data from sensors of network traffic to enable responses that are powered by big data an AI.
There can be a blurry line between logging and monitoring. The Twelve-Factor App mentions active alerting coming from "user-defined heuristics" when it describes how applications should log. To learn about event stream processing, you may want to read this posting.
5. The verbosity of the logs should be adjusted to balance your needs of diagnostics, performance, and/or your monetary budget and disk space constraints. Some logs will need to be rotated more frequently if they contain rich details. You will need to find a balance between your budget for disk space and need for archived data to be readily accessible for the proper verbosity and log rotation settings.
If you need to meet specific SLOs, you will be concerned with the details of SLIs. Thus you will want detailed logs. According to Google Cloud's Blog "You will also need a precise measurement of compliance, usually from logs analysis."
6. Characteristics of good logging will involve compressing some of the logs and using the correct device to physically store the log files. Compressing the logs can make them impossible to index, but it can save your budget for storage capacity. You may want to use a NAS or SAN to store the logs. These devices are usually commodious and ideal for persistence because logs are not accessed as frequently as regular application data. If you use the public cloud Amazon Web Services S3, Azure Blob Storage or GCP's Cloud Storage are other options for storing the hardware. Whichever solution you choose to store the log files can normally be configured to have disk or some form of redundancy.
7. High-quality logging solutions give professionals powerful tools for manipulating the logs. Consider using CloudWatch, Splunk, FluentD, rsyslog, Elastic Stack (with LogStash and Kibana), Sentry**, or vRealize Log Insight for centralized logging. These are popular in the modern I.T. industry (as of 2019). Splunk and Elastic Stack, for example, allow you to index the logs to learn more about common events. Good software tools related to logging, such as indexers and dashboards, can help you search them and make sense of the logs, correct problems or introduce performance enhancements. Good logging is reliable, scalable and maintainable. Systems' non-functional requirements of reliability, scalability and maintainability are described as in this posting.
8. Consider implementing GitOps. GitOps is the implementation of driving operations through Git's file versioning capabilities. You necessarily log as you issue discrete commands. Looking at older versions of individual files you can see every command that was issued. Amazon's five pillars of architecture recommends that you perform operations as code. (The source of this last sentence is page 6 of this PDF which is labeled page 2.)
9. When users use Putty, make it a business policy company-wide to configure logging. Open Putty. On the left expand "Session" and click on "Logging." On the right click "All session output" or some option besides "None". Then enter a destination of the log in the "Log file name" field.
10. Large businesses have endured tarnished reputations as the result of hackings. Europe's General Data Protection Regulation punishes those responsible for the release of private data. Therefore we recommend that good logging involves removing sensitive data early on. Some logging mechanisms know how to identify sensitive data (e.g., users' passwords or social security numbers). The sooner that this data is obfuscated in the logs, the greater chance you will prevent a catastrophe. The logs may be transmitted elsewhere or archived for a long period of time. It can be easy to forget how verbose the logging was. To learn about how Jenkins filters data before it is logged, see this posting.
11. Test production logging in lower environments with your development and your upcoming infrastructure patches. Logging can create disk I/O, CPU, or network loads that may be detrimental to performance. As you make changes, remember that one factor in The Twelve Factor App is dev/prod parity. Good logging is tested and approved by modern quality assurance standards. I/O can be constrained by intent log restrictions. Ideally production loads should not be different from lower environments.
12. Increase observability to the extent you do not increase your attack surface, and log the observable features. This adds traceability to your logging solution for post-mortems. For security investigations it is advantageous to know which account performed what actions. Traceable actions with maximized observability are only restricted to the extent you cannot maintain a hardened environment. The influential book Continuous Delivery (on page 320) says that auditability is very important for an application.
13. This is not a recommendation. This is just an observation. There is no out-of-the-box solution that will be perfect for your logging needs. Every environment is different, and every business' goals are different. You will have to decide for your business what needs to be logged and how it should be logged based on your budget and performance goals.
*To read articles that explain why ext4 is a good all-around file system, see these hyperlinked articles at howtogeek.com or lifehacker.com.
**Packt Publishing (on page 271 of the 3rd Edition of Expert Python Programming) says that Sentry is well-suited for "…tracking exceptions and collecting crash reports." There is a newer version of this book however.