How would a DevOps engineer define operational readiness or what is operational readiness in I.T.?
“All things are ready, if our mind be so.” -William Shakespeare
While some people would say it is more of a state*, others may say it is more of a journey**. Operational readiness is having sufficient staff and automation to maintain the minimum level of service that the business requires for a given product or service. Each business has different SLAs based on market conditions.
To crystallize these needs, we provide the following requirements to achieve operational readiness:
- There should be well-defined requirements of the service level agreement. You should have details on what availability you must maintain (e.g., for the SaaS product you support). To do this, you must define SLIs and SLOs.
- You should have available a list of the stakeholders for your product(s) or service(s). Communication with them could be critical to the business' objective. Downtime can be mitigated with proper communication to customers and/or clients.
- While full stack developers and DevOps generalists are recommended as being under-valued, there still should be clearly defined roles of employees and/or companies that supply talent or services. If the budget is large enough, you should have 24/7 staff. If constant staff are not available, some people should be on-call. They must know what to expect and how soon they must respond. Escalation procedures should be clear and available even if several key servers are unavailable. The DevOps Handbook says that 80% of outages happen from a change someone made (page 203).
- There should be a list of resources, assets and/or components that are necessary to maintain operations as defined in the service level agreement itself. Know who owns any external dependencies too (e.g., internet hosting facilities and outsourced functions). Information may be important, but physical assets and contracts with other businesses should be available in catastrophic situations.
- There should be backups and a business continuity process in place. The disaster recovery plan should be physically printed out and available in different geographic regions. It should give you the tools to consistently recreate your environment from just the backed up media. Ideally the file media back ups and/or data center colocations will be in different floodplains and in different fault lines too. As tedious and costly as it may be, the business would be well-served by testing the installation media with people who are not acquainted with your business. Rudy Guiliani recommends people prepare relentlessly.
- The Twelve-Factor App lists "Dev/prod parity" as the tenth factor. There should be thorough testing in lower environments before code is ready for production. We believe that operational readiness would normally include having development and quality assurance environments be substantially similar in terms of hardware with production. We also believe thorough testing must take place before the code is ready for production.
- For a greenfield environment that is about to go live, the underlying systems (e.g., components of microservices) should be stable with enough resources to resiliently handle spikes in demands (network traffic and server workloads). There are many load-testing tools for web applications. To generate artificial traffic to a website, you could try Gatling. A complementary tool to generate network traffic, try Ostinato.
- While it is ideal to know your environment is ready before you go live, there are things you must do on an ongoing basis to maintain operational readiness. For many environments there should either be testing in production or an exceptional reason why this would be a bad idea. It is our opinion that operational readiness is a journey. Chaos engineering is the practice of robustly testing resilience and high availability. The term comes from what NetFlix designed and used called "chaos monkey." This program deliberately corrupts random servers on an ongoing basis. This tests the monitoring and alerting and it guarantees people on their toes. The professionals do not necessarily know if the problem arose from chaos monkey or from human operations. Management could manually trigger some chaos to see if protocols are followed correctly.
- There should be sufficient logging (e.g., of system authentications, data access, data changes, system start ups and shut downs). Postmortems for diagnosis can be valuable. Having centralized logging (or telemetry) can help if there is disk failure. For performance improvement, security investigations, satisfying audits, and disaster recovery goals, adequate logging is critical to being ready.
- There should be sufficient monitoring to maintain any service level agreements. The monitored components should be relevant to a necessary component. Some metrics should have corresponding thresholds to trigger an alert. The thresholds ideally will alert someone to a future problem before the undesired contingency happens. The monitoring system should be monitored itself with potential alerts.
- Security (including confidentiality, integrity and availability) pre-mortems are important. There should be security mechanisms in place (e.g., firewalls, intrusion detection systems) that are separate and complementary. If one security component is compromised, other devices may be able to protect your network. There should be approval from a penetration testing consulting company that your network and servers are ready for potential hackers. Cutting edge security protocols, practices, and devices should be analyzed on an ongoing basis. The website us-cert.gov is useful as well as Bruce Schneier's website. To secure individual containers, we recommend this posting "How do you secure a docker container, a docker host and their network." If you want to buy some books on security (which are more trustworthy than random websites), click here for a list of a wide-range of I.T. security topics.
- Have proper communication in place so the documentation, definitions, and lists associated with maintaining or restoring operational readiness is something that cannot be institutionally forgotten. The onboarding of professionals should be methodical so new hires know what to expect. Some environments have a culture of "move fast and break stuff." If policies and infrastructure permit rapid development, groups in an organization will want to leverage it. Alternatively if people are not supposed to download files from the internet or they are expected to manually configure PuTTy logging on their own desktops, you cannot expect new hires to know the procedures without telling them. Make sure you have a bug tracking system (such as Jira or ServiceNow) and a Confluence or SharePoint repository for documentation. Excessive communication and meetings can impair performance, resilience and morale of an organization. Beware of having insufficient communication too.
- Manual processes are prone to error. Maintaining products and services is more efficient by utilizing automation.
- Every enterprise network environment has its own idiosyncrasies, strengths and needs. Therefore in normal situations you will have other specific requirements -- not listed above -- for true operational readiness.
* The Project Management Institute defines (or defined) operational readiness as a state.
** The conclusion of this paper says operational excellence (something closely related to operational readiness) says it is an "ongoing effort."