Operational Resilience and the true value of Operational Acceptance Testing

Laurence O'Rourke, Sr. Project Manager

While we continually strive for ever more automation and increasingly look at CTOs towards IaaS and PaaS to enable this shift, a pertinent question comes to mind: Is this technology shift a detriment to our organization if we overlook some of the traditional testing phases due to incorrect assumptions?

There is an ever-changing drive toward the DevOps and BizOps models, increased automation and frequency of delivery via the CI/CD pipelines, automated testing integration, with an ongoing shift to IaaS cloud-based technologies. However, it is vital to consider if we are increasing our operational (including financial, regulatory, and reputational) risks by overlooking the importance of Operational Acceptance Testing (OAT). 

We must ask ourselves, “does this shift to managed technology provide us with a false sense of security across our organization by incorrectly assuming that all operational resiliency is included in the cost if it is cloud or vendor hosted and managed?” 

cloud computing technology
Source: Adobe Stock

So why this shift away from OAT? That might stem from several issues: 

With the swiftly altering IT landscape, the promise of instant improvements in resilience, scalability, reduced time to market, and ease of maintenance for the customer base primarily drive the increased adoption of microservices and API-style architecture. 

This promise, alongside the shift from a traditional waterfall or V model methodologies to more lean and agile ways of delivery, has meant that for several technology managers, an assumed lengthy manual test cycle upon final code delivery and infrastructure delivery to validate the operational aspects of the system is seen as either a blocker on our speed to market or a poor ROI. Knowledge and awareness of operational requirements are often poorly presented or missing from the features and stories. This means there is insufficient allotment of time, cost, and resources until late in the development cycle. 

There is also a perceived probability of reliance on IaaS of low failures and high fault tolerance. *Most cloud services within Microsoft Azure come with a 99.95% SLA guarantee which is much higher than most on-premises data centers can hope to offer. However, this does not consider the configuration of all components and code developed in-house. 

A false sense of security based on historic operational stability (it hasn’t happened yet, so it never will), or the trust that IaaS will automatically bring the service back to life? It is a cloud-provided database, so there is no requirement for us to validate our resiliency or data backups, right? Wrong! As we move to cloud-based technologies, this line of thought must be avoided as an increasing fallacy.

Real-world scenario

In August 2019, Nissan Group’s data centre in Denver crashed. The impacted system, known internally as NNAnet, is called Nissan’s lifeblood. A Nissan solution for employees to order cars/parts, manage product rebate sales, get info on vehicle recalls, file warranty claims needed to price and start service work, and get financing information. The system remained down for four days, impacting operations at many retailers and production systems at two factories.   

The company, including retailers, and customers, were all impacted – an instance where correctly validated high availability systems would have mitigated or at the least minimized the crash’s impact. This situation turned into a literal disaster for Nissan as “commerce among consumers, retailers, distribution networks, manufacturing plants and finance companies.” were all affected. The total financial impact for Nissan and its dealers/retailers/partners is still unclear. 

cloud data center
Source: Adobe Stock

Debunking the myths:  

So why and how should we look for increased OAT as we move ever more to IaaS and microservices as our preferred technology strategy? And how can we debunk some of the earlier myths in this article? 

Our systems’ probability of failure increases by moving to more complex and heavily integrated microservices. Yes, at the service level, IaaS does provide us with a level of certainty; however, we build cloud architecture on the premise that hardware will fail at some point. For this reason, identifying and validating the complexity of the various integrations of “x” as early as possible in the development cycle is a must. 

The impact of the shift to IaaS and IaC (Infrastructure as code) is far from being a blocker to your development cycles. This approach now allows the DevOps and BizOps teams to manage the OAT in-line with your agile framework. Acceptance testing in isolation of the broader system as part of the early iterations is done by quickly deploying the IaaS components and validating through targeted user stories based on operational requirements. 

data center IT specialist using laptop computer
Source: Adobe Stock

True test automation and increased scope 

One of the key benefits of the shift to IaC is that OAT can now be fully automated and integrated with your CI/CD pipeline through several SaaS and open-source toolsets. Given the nature of IaC, these toolsets provide us with the ability to create real-world scenarios at the click of a button allowing for easier and repeatable use cases. IaC can quickly stop, start, and fail without manual intervention. 

OAT Benefits 

  • Operational Acceptance Testing reduces downtime, reduces the impact of any failed changes, and supports faster overall delivery at a reduced cost. 
  • It allows us to provide an accurate operational metrics baseline and measure the failures’ potential impact. 

Typical scenarios we at Fulcrum Digital can we help with:  

  • Identifying operational requirements: Fulcrum will support you in identifying and specifying operational requirements early in the lifecycle and verify the meeting of those requirements early in the development lifecycle.  
  • Understanding the single points of failure – Is your data and replication up to date and easily restorable? By providing a detailed impact assessment of your current architecture and/ or solution design FD can quickly help to identify any single points of failure and validate the impact of any of these failures on your business.  
  • Recovery from a major incident – Should a significant outage occur; do you have runbooks available? Are your support teams resilient and able to act quickly in this scenario? By planning and executing chaos game days with your technical or BizOps teams, FD can quickly and easily identify where you have gaps in your knowledge and work with the operations resources to create runbooks and identify training plans. 
Contact us to learn more about how to identify business needs and determine solutions!