Here’s Why Enhancing MTTR over MTBF is optimal for Companies

EC Cloud

Here’s Why Enhancing MTTR over MTBF is optimal for Companies

6 Nov 2020

SaaS Enterprise: Release Frequency vs. Reliability and Availability

In Software as a Service (SaaS) enterprise, one of the arguable debates is between release frequency vs. reliability and availability. That is, are you Team MTTR (mean time to restore) or Team MTBF (mean time between failure)? I support MTTR in this blog post, which encourages you to push more often, welcome the uncertainty that this can introduce, and invest in training and resources to cope with the resulting outages. It’s just the principle of constantly pushing out minimum viable products (MVPs), production testing, and failure acceptance. And here’s why it works.

More Testing Leads to Improved Quality

For engineering teams, the whole concept of optimizing for MTTR can be counterintuitive because it can be overwhelming when things go wrong. But that attitude is precisely why it is laborious for individuals who exploit an MTBF approach to address issues as they arise. It’s difficult to refine the answer process when errors happen infrequently.

There is more testing and more code review with a continuous stream of releases because the anticipation and also the encouragement to fail is there. With that in mind, the team is ready to refine and iterate on the code for failure recovery, which eventually gives them greater product familiarity and offers improved reliability.

Changes often take longer to execute for those who minimize how much code is deployed. The 80/20 theory applied to engineering says 80 percent of the time is required for the last 20 percent of the job. With SaaS, the last 20 percent is usually deployed through staging and development with your feature. If you only deploy infrequently, say, weekly or monthly, the “quantization” size cannot be smaller than a week for each function.

Although this conservative strategy leads to a more secure site on paper, this results in a stale product in practice. The commodity that wins in the marketplace is not necessarily the best—it’s the one that more easily responds to consumer needs.

We do not generally spend all of our resources in creating the most available product with an MTTR strategy. We basically put our effort into building the minimum product and tightening the feedback loop to the maximum extent possible. We will easily stand up to a different, and even better, service that represents the evolving needs of the client if the product does not behave correctly, and that will always happen.

Embracing the Outcomes Contributes to Stability

SaaS can be released very regularly, as often as several times a day, unlike conventional business applications. This enables SaaS companies to adapt rapidly to evolving consumer demands, all while placing zero pressure on their users (i.e., they do not need to continually update their software).

That said, during the holiday season, there are usually no “releases” or changes to a website due to:

  • Lots of clients and your most lucrative time during this stage.
  • Lots of main workers taking time off.

Therefore, many firms do not want to produce releases that could endanger profits and take individuals away from their holiday plans. So there are code freezes for e-commerce websites. At this time, several brands cease making launches. All these activities demonstrate that you mitigate change in order to maximize MTBF.

The holiday season is the most stable time of year for a SaaS vendor based on MTTR, as are all holidays, because of the high pace at which the team deploys updates and the familiarity every developer has with the code base.

However, the possibility of introducing new bugs and outages comes with each new update. Strategies used to decrease the risk include continuous deployment, blue-green deployments, and canarying. The principle is that you can reduce alterations between any two implementations by making releases more regular. Consequently, there is a reduced risk of erratic interactions and a better probability of quickly determining which release triggered a problem and thus which improvement.

Reinforcing the Teams

When there is an outage, MTTR helps create a more resilient on-call team that does not flinch. When the team is qualified to handle failures on a daily basis, when they get paged, there is no tension and the process can be streamlined, which means quicker repairs. Uncertainty is welcome because new versions of the code may fail in unpredictable ways.

These unpredictable issues help educate new members of the on-call team. With drills and training alone, it is difficult to instill confidence in engineers. Actual incidents, however, teach them faster. One solution is to artificially implement problems. This helps you find items like single failure points, but works best with systems that do not undergo a high rate of change.

Another solution is to more regularly install new applications, which may introduce real issues to fix.

For MTTR, Here’s How to Optimize
  • Release less frequently when you want to optimize for MTBF, resulting in a “stale” product that cannot adapt to changing consumer demands. Optimize for MTTR for faster recovery.
  • Introduce a high cadence of release/deployment to adapt quickly to client needs and ship features they want.
  • Adopt tools and technologies such as Kubernetes, which allow you to automate and frequently execute release and deployment.
  • Ensure your application is well-instrumented with Grafana, Prometheus, and Loki for strong observability. This builds team trust and faster problem-solving.
  • Track release cadence and load on call (incidents per shift) and balance the two. If too many incidents, focus on tech debt; if too few, take more risks to increase learning opportunities.
  • Promote iteration: evaluate which pages or features are most frequently used and aim to release more often in response to user demand.
Summing Up

The “worse is better” argument is used by some as a critique. In the end, it’s about being agile and having something out there so that individuals can test it even if the product might not yet be best in class.

Teams starting from zero can move quickly even if the system is initially unstable. Outages or vulnerabilities teach lessons, and the team can pivot and improve the system. If necessary, they can even rebuild entirely and deploy a new product. As the saying goes, Rome wasn’t built in a day.

×