Question
AWS's Five Pillars of Well-Architected Framework recommend monitoring as a component of three of the pillars (operational excellence, reliability and performance efficiency).
There is no question that monitoring is important inside or outside of a public cloud. What patterns or traits might a good centralized monitoring system have (consistent with some phrase as "best practices")?
Answer
Here are 15 characteristics of good monitoring.
1. Alert thresholds of various metrics should follow a Goldilocks approach as opposed to a cry-wolf approach. You do not want to be notified of small disturbances. Normally disk space utilization above 50% of what is available is not a concern. But if disk space on a server has reached 90% utilization, you may want to begin to take action. A study of I.T. professionals indicated that reducing false alarms in monitoring was a challenge for 79% of those questioned.
You must define or ascertain SLIs and SLOs. SLOs are closely related to the threshold that you will be notified about. SLIs will be the metrics themselves. Certain quantifiable levels should notify you before your SLOs are lost so you have time to react.
2. Ensure redundancy for the monitoring solution. A single point of failure can be a problem. If there is only one centralized monitoring server, you will want it to have two NICs, two power supplies connected to different UPSes, two CPUs, two RAM chips, and a RAID to ensure the server is up. If the monitoring system itself goes down, will you be notified of that? If not you may detrimentally rely on false positives.
3. Test the monitoring solution itself. From time to time you should manually create an artificial event to trigger emails/pages/text messages are being sent. To generate large amounts of web traffic, we recommend Gatling. To generate a CPU or RAM load on a server we recommend using Bash. While some people prefer Ruby or Python for infrastructure automation, these languages have useful tools to stop the program from consuming too many resources. Higher level languages are therefore not as equipped to generate an artificial load as Bash.
Chaos engineering is the practice of robustly testing resilience and high availability. The term comes from what NetFlix designed and used called "chaos monkey." This program deliberately corrupts random servers -- in production -- on an ongoing basis. This tests the monitoring and alerting and it guarantees people on their toes. The professionals do not necessarily know if the problem arose from chaos monkey or from human operations. Management could manually trigger some chaos to see if protocols are followed correctly.
4. Have a method of deploying new servers that ensures it will be monitored from the beginning. When you deploy a new server, it must receive configuration (e.g., an agent installed and configured) so the monitoring system will work with it. Certain firewall rules must allow for this communication. Once installed and configured it will alert you to monitored events as they happen. With public clouds this is less of a concern because alerts are often from PaaS offerings.
5. Consider performance with your frequent checks. Some automated SQL commands can lock tables or otherwise generate a load on the database server. The client agent by itself on the monitored server could use some RAM and CPU depending on what it is checking for. Network communication between the client and server can contribute to congestion depending on how much is sent and how frequently it is sent. Setting interval durations and configuring the minimum data retrieval as necessary can help ensure you allow the servers to perform well while monitoring the precious servers carefully. For reading logs on a disk (as a part of monitoring), you may want to consider putting such logs on a dedicated disk as they may be written to frequently. A performance bottleneck could be prevented based on the location of where the logs are stored.
6. There should be multiple ways of viewing the monitoring system. You will want a website that the professionals can go to for visual checks. A dashboard can have colors to give an overview of the status of the critical systems. There should be a way to notify individuals via phone call or text message. It can be desirable to also have a physical flat panel monitor to be in a common area to ensure the employees and management can see the health of the systems that they are monitoring. According to The DevOps Handbook, public telemetry helps create institutional learning and a productive atmosphere (page 203).
7. Use monitoring in Development and QA environments. The overhead and dependencies should be similar for testing purposes anyway. Parity in lower environments with the upper environments is one of the factors in 12 factor apps.
8. Document procedures for when problems occur and methods for turning off alerts. (While we technical bloggers are biased toward favoring documentation, we are not alone. You can read more about this here.)
The Agile Manifesto says that the signatories found value in comprehensive documentation. Having human operating procedures in place can help with large and growing teams so people know what to expect. AWS's Five Pillars of Well-Architected Framework recommend you annotate documentation. Outsourcing and ambitious automation can be easier with good documentation.
9. Ensure that the group responsible for deploying infrastructure packages and the group responsible for the CI/CD pipeline can leverage a monitoring solution. For security and pragmatic professional productivity reasons, leveraging a centralized monitoring tool for both infrastructure and code deployment can be advantageous. Some special work may go into configuring the monitoring a build and release pipeline, but it is highly recommended.
10. Network latency can degrade SaaS performance. TCP/IP collisions can degrade network performance significantly. Tuning and monitoring the network (e.g., with Cacti) can be one part of monitoring your servers. Some events are triggered because of network congestion. This congestion can happen independent of the server's OS, application or databases.
11. High-quality monitoring involves SIEM (security information and event management). Access logs should be analyzed for certain patterns such as HTTP status codes (according to page 271 of Expert Python Programming). IDSes and IPSes can change their threshold sensitivity rules upon certain events. With large amounts of data, statistical operations can enable anomaly detection with artificial intelligence and big data. Apache Metron is an ambitious project with multiple use cases. One such utilization is to collect monitoring data from network sensors (that collect network packet data). There can be a blurry line between logging and monitoring (see this posting for logging).
12. Remember to do black-box monitoring. New Relic monitors at the application level which is quite different from obsessing over abstract SLIs. Application level monitoring focuses on business performance metrics and not on theoretical CPU/memory consumption levels. Sometimes theoretical monitoring tools can report that everything is fine when a critical application is totally unusable to a customer. Robotic Process Automation can do pattern recognition to deal with images and legacy technologies. Selenium is not recommended for monitoring, but it can be used in this way.
13. Evaluate different monitoring tools thoroughly before you choose one. There are advantages and disadvantages of many options. Some companies want close integration with a ticketing system such as ServiceNow. In today's world of hybrid (private-public) clouds, one key SLO may be to stay under a certain budget. Certain monetary costs may need to be closely monitored. Many organizations like hosted, subscription services for monitoring such as Datadog or PagerDuty. You may want to use CloudWatch if you are using AWS. The big offerings such as AWS, Azure, GCP, DigitalOcean and RackSpace all have their own solutions for monitoring their customer's cloud. These options cost money, and you could monitor services using more affordable methods. If you host your own solution, you will want to be notified if the monitoring solution itself fails. There are many considerations to weigh.
For monitoring servers you may consider AppDynamics, collectd, Dynatrace, Ganglia, Icinga, Instana, LogicMonitor, Monit, Nagios, Sensu, Service Assurance, Spotlight, Sumo Logic, Sysdig, Zabbix, or Zenoss. Open source options may have unsupported bugs, but they can scale without licensing costs. Open source tools can also be more flexible for your organization to modify and design specialized features. Systems' non-functional requirements are monitored by tools that are reliable, scalable and maintainable; ideally your choice will enable you to analyze historical trends that is not monetarily expensive. Every enterprise and use-case has different priorities.
If you want to try out Nagios (an open source tool), click here if you are using a Red Hat derivative or click here if you are using a Debian distribution of Linux. If you want to try out Zabbix (an open source tool), click here. You will likely want to consider multiple options as many monitor servers but do not monitor network traffic. For monitoring network traffic you may want to look at Cacti or SolarWinds.
14. Remember to ensure you monitor containers. For Kubernetes and Docker container monitoring, we recommend Prometheus. You may also want to try OpVizor a separate and sophisticated proprietary tool. For the Crashloopbackoff setting in YAML files related to Kubernetes, configure it wisely as it pertains to readiness and monitoring; see this posting for more information.
15. Architect ample observability (and possibly measurability) in the systems.
"You can't manage what you can't measure." -Peter Drucker
https://www.contractguardian.com/blog/2018/you-cant-manage-what-you-cant-measure.html
The book The Mythical Man-Month (on page 243) says "[m]ethods of designing programs so as to eliminate or at least illuminate side effects can have an immense payoff in maintenance costs." For coding (as opposed to systems administration and operations work), use metaprogramming to increase observability. "Metaprogramming is a technique of writing computer programs that can treat themselves as data, so they can introspect, generate, and/or modify itself while running." (This was taken from page 158 of Expert Python Programming. The newer 4th edition is available here.) Observability increases the number of "users" so-to-speak of a given back-end service. The Mythical Man-Month (on page 242) says "[m]ore users find more bugs." Linus' law as defined in The Cathedral and the Bazaar (on page 30) says that "[g]iven enough eyeballs, all bugs are shallow." Developers, operations professionals, QA, system designers benefit from observability for debugging sessions and root cause analysis. There is a cost to engineering minute observability, but we recommend some costs be accepted for the sake of monitoring.
The allegory of the cave in Plato's Republic illustrates limitations to understanding and the nature of learning new things. Some cave-dwelling people may be accustomed to seeing shadows and hearing echoes because of the fact that there is only one wall they can see and hear from. What they observe is different from what people outside the cave observe. Increasing observability is like going outside the cave. You cannot forget that observability may accompany false positives and false negatives of individual components. Remember that monitoring may not be a substitute for communication and humans verifying that services are up or down; complete system tests are less theoretical than individual component tests.
16. Have a disaster recovery plan that does not rely on monitoring. One reason is that on 12/7/21, AWS had a serious outage. Being arguably the most trusted public cloud in the world, if Amazon can have such a problem, it can surely happen to other companies.
You must have a way to fix systems without monitoring, and your system needs an ability to communicate with customers when an outage happens. It is not cheap to have such DR means in place, but it is advisable and complementary to good monitoring. Here are some quotes from Amazon's statement about the outage:
...
This congestion immediately impacted the availability of real-time monitoring data for our internal operations teams, which impaired their ability to find the source of congestion and resolve it.
...
Our Support Contact Center also relies on the internal AWS network, so the ability to create support cases was impacted from 7:33 AM until 2:25 PM PST.
https://aws.amazon.com/message/12721/
Finally, you may want to read The Art of Monitoring by James Turnbull because it was cited in The DevOps Handbook. You may also want to read Practical Monitoring by Mike Julian.