AWS's Five Pillars of Well-Architected Framework recommend monitoring as a component of three of the pillars (operational excellence, reliability and performance efficiency).
There is no question that monitoring is important inside or outside of a public cloud. What patterns or traits might a good centralized monitoring system have?
Here are 12 characteristics of good monitoring.
1. Alert thresholds of various metrics should follow a Goldilocks approach as opposed to a cry-wolf approach. You do not want to be notified of small disturbances. Normally disk space utilization above 50% of what is available is not a concern. But if disk space on a server has reached 90% utilization, you may want to begin to take action. A study of I.T. professionals indicated that reducing false alarms in monitoring was a challenge for 79% of those questioned.
You must define or ascertain SLIs and SLOs. SLOs are closely related to the threshold that you will be notified about. SLIs will be the metrics themselves. Certain quantifiable levels should notify you before your SLOs are lost so you have time to react.
2. Ensure redundancy for the monitoring solution. A single point of failure can be a problem. If there is only one centralized monitoring server, you will want it to have two NICs, two power supplies connected to different UPSes, two CPUs, two RAM chips, and a RAID to ensure the server is up. If the monitoring system itself goes down, will you be notified of that? If not you will rely on false positives.
3. Test the monitoring solution itself. From time to time you should manually create an artificial event to trigger emails/pages/text messages are being sent. To generate large amounts of web traffic, we recommend Gatling. To generate a CPU or RAM load on a server we recommend using Bash. While some people prefer Ruby or Python for infrastructure automation, these languages have useful tools to stop the program from consuming too many resources. Higher level languages are therefore not as equipped to generate an artificial load as Bash.
Chaos engineering is the practice of robustly testing resilience and high availability. The term comes from what NetFlix designed and used called "chaos monkey." This program deliberately corrupts random servers -- in production -- on an ongoing basis. This tests the monitoring and alerting and it guarantees people on their toes. The professionals do not necessarily know if the problem arose from chaos monkey or from human operations. Management could manually trigger some chaos to see if protocols are followed correctly.
4. Have a method of deploying new servers that ensures it will be monitored from the beginning. When you deploy a new server, it must receive configuration (e.g., an agent installed and configured) so the monitoring system will work with it. Certain firewall rules must allow for this communication. Once installed and configured it will alert you to monitored events as they happen. With public clouds this is less of a concern because alerts are often from PaaS offerings.
5. Consider performance with your frequent checks. Some automated SQL commands can lock tables or otherwise generate a load on the database server. The client agent by itself on the monitored server could use some RAM and CPU depending on what it is checking for. Network communication between the client and server can contribute to congestion depending on how much is sent and how frequently it is sent. Setting interval durations and the minimum data necessary can help ensure you allowing the servers to perform well while monitoring the precious servers carefully.
6. There should be multiple ways of viewing the monitoring system. You will want a website that the professionals can go to for visual checks. A dashboard can have colors to give an overview of the status of the critical systems. There should be a way to notify individuals via phone call or text message. It can be desirable to also have a physical flat panel monitor to be in a common area to ensure the employees and management can see the health of the systems that they are monitoring. According to The DevOps Handbook, public telemetry helps create institutional learning and a productive atmosphere (page 203).
7. Use monitoring in Development and QA environments. The overhead and dependencies should be similar for testing purposes anyway. Parity in lower environments with the upper environments is one of the factors in 12 factor apps.
8. Document procedures for when problems occur and methods for turning off alerts. (While we technical bloggers are biased toward favoring documentation, we are not alone. You can read more about this here.)
The Agile Manifesto says that the signatories found value in comprehensive documentation. Having human operating procedures in place can help with large and growing teams so people know what to expect. AWS's Five Pillars of Well-Architected Framework recommend you annotate documentation. Outsourcing and ambitious automation can be easier with good documentation.
9. Ensure that the group responsible for deploying infrastructure packages and the group responsible for the CI/CD pipeline can leverage a monitoring solution. For security and pragmatic professional productivity reasons, leveraging a centralized monitoring tool for both infrastructure and code deployment can be advantageous. Some special work may go into configuring the monitoring a build and release pipeline, but it is highly recommended.
10. Network latency can degrade SaaS performance. TCP/IP collisions can degrade network performance significantly. Tuning and monitoring the network (e.g., with Cacti) can be one part of monitoring your servers. Some events are triggered because of network congestion. This congestion can happen independent of the server's OS, application or databases.
11. High-quality monitoring involves SIEM (security information and event management). IDSes and IPSes can change their rules upon certain events. With large amounts of data, statistical operations can enable anomaly detection with artificial intelligence and big data. Apache Metron is an ambitious project with multiple applications. One such application is to collect monitoring data from network sensors (that collect network packet data). There can be a blurry line between logging and monitoring (see this posting for logging).
12. Evaluate different monitoring tools thoroughly before you choose one. There are advantages and disadvantages of many options. Some companies want close integration with a ticketing system such as ServiceNow. In today's world of hybrid (private-public) clouds, one key SLO may be to stay under a certain budget. Certain monetary costs may need to be closely monitored. Many organizations like hosted, subscription services for monitoring such as Datadog or PagerDuty. You may want to use CloudWatch if you are using AWS. The big offerings such as AWS, Azure, GCP, DigitalOcean and RackSpace all have their own solutions for monitoring their customer's cloud. But these options cost money, and you could monitor services using more affordable methods. If you host your own solution, you will want to be notified if the monitoring solution itself fails.
For monitoring servers you may consider collectd, Dynatrace, Ganglia, Icinga, Nagios, Sensu, Sumo Logic, Sysdig, Zabbix, or Zenoss. Open source options may have unsupported bugs, but they can scale without licensing costs. Open source tools can also be more flexible for your organization to modify and design specialized features.
If you want to try out Nagios (an open source tool), click here if you are using a Red Hat derivative or click here if you are using a Debian distribution of Linux. If you want to try out Zabbix (an open source tool), click here. New Relic monitors at the application level which is quite different from obsessing over abstract SLIs. You will likely want to consider multiple options as many monitor servers but do not monitor network traffic. For monitoring network traffic you may want to look at Cacti or SolarWinds.
Remember to ensure you monitor containers. For Kubernetes and Docker container monitoring, we recommend Prometheus. You may also want to try OpVizor a separate and sophisticated proprietary tool; if you are interested in learning more, email sales [at] americandevops.com.