The Cost of Software Bugs: 5 Powerful Reasons to Get Upset

Posted by

If you read the PossumLabs blog regularly, you know already that I am focused on software quality assurance measures and why we should care about implementing better and consistent standards. I look at how the software quality assurance process affects outcomes and where negligence or the effects of big data might come into play from a liability standpoint. I also consider how software testing methodologies may or may not work for different companies and situations.

If you are new here, I invite you to join us on my quest to improve software quality assurance standards.

External Costs of Software Bugs

As an automation and process improvement specialist, I am somewhat rare in my infatuation with software defects, but I shouldn’t be. The potential repercussion of said bugs is enormous.

And yet you ask, why should YOU care?

Traditional testing focuses on where in the development lifecycle a bug is found and how to reduce costs. This is the debate of Correction vs. Prevention and experience demonstrates that prevention tends to be significantly more budget-friendly than correction.

Most development teams and their management have a singular focus when it comes to testing: they want to deliver a product that pleases their customer as efficiently as possible. This self-interest, of course, focuses on internal costs. In the private sector profit is king, so this is not surprising.

A few people, but not many, think about the external costs of software defects. Most of these studies and the interested parties tend to be government entities or academic researchers. In this

In this article, I discuss five different reasons that you as a consumer, a software developer or whomever you might be, should be concerned with the costs of software bugs to society.

#1 No Upper Limit to Financial Cost

The number one reason that we should all be concerned is that in reality software costs for defects, misuse or crime likely have no upper limit on their expense.

In 2002 NIST compiled a detailed study looking at the costs of software bugs and what we could do to both prevent and reduce costs, not only within our own companies but also external societal costs. The authors attempted to estimate how much software defects cost different industries. Based on these estimates they then proposed some general guidelines.

Although an interesting and useful paper, the most notable black swan events over the last 15 years demonstrate that these estimates provide a false sense of security.

For example, when a bug caused $500 million US in damage with the Ariana 5 rocket launch failure, observers treated it like a freak incident. At the time, little did we know that the financial cost of freak incident definition would continue to grow a few orders of magnitude just a few years later.

This behavior goes by many names, Black Swans, long tails, etc. What it means is that there will be extreme outliers. These outliers will defy any bell curve models, they will be rare, they will be unpredictable, and they will happen.

Definitions:

Black Swan is an unpredictable event as so named by Nassim Nicholas Taleb in his book The Black Swan: The Impact of the Highly Improbable. It is predicted that the next Black Swan will come from Cyberspace.

Long tail refers to a statistical event in which most events will happen in a specific range whereas a few rare events will occur at the end of the tail. https://en.wikipedia.org/wiki/Long_tail

Of course, it is human nature always to try and assemble the clues that might lead to predicting a rare event.

Let’s Discuss Some Examples:

4 June 1996

A 64-bit integer is written to a 16-bit value, and 500 million dollars went up in flames. As you see in Table 6-14 (page 135), as published in the previously mentioned NIST study, the estimated cost for software defects for the aerospace industry for a company this size was only $1,289,167. And so, 500 million blows that estimate right out of the water.

This single bug cost 200 times the expected annual cost of defects for a company.

May 2007

A startup routine for engine warm-up is released with some new conditions. The estimate for the automotive industry’s cost of software bugs in 2002; per company, per year as seen in Table 6-14 (page 135). Company Costs Associated with Software Errors and Bugs Automotive for a company bigger than 10,000 was only $2,777,868. That is not even a dent in the cost to Volkswagen — this code cost Volkswagen 22 Billion dollars.

That equates to about 10,000 times the expected costs of defects per company per year.

This behavior goes by many names, Black Swans, long tails, etc. What it means is that there will be extreme outliers. These outliers will defy any bell curve models, they will be rare, they will be unpredictable, and they will happen.

It is human nature always to try and assemble the clues that might lead to predicting a rare event. Unfortunately, when it comes to liability, it seems only academics are interested in this type of prediction, but given the possibility of exponential costs to a single company, shouldn’t we all be concerned?

#2 Data Leaks: Individual Costs of Data Loss?

Data leaks of 10-100 million customers are becoming routine. These leaks are limited by the size of the datasets and thus unlikely to grow much more. In large part that is because not many companies have enough data to move into the billions of records data breaches.

Facebook has ~2 billion users, the theoretical limit of a data breach is therefore limited to Facebook, or a Chinese or Indian government system. We only have 7.5 billion people on earth so to have a breach of 10 billion users we first need more people.

Security Breaches are limited by the Human Population Factor

That is what makes security breaches different, the only thing that it tells us is that we will approach the theoretical limit of how bad it could be. The Equifax breach affected 143 million users.

When it comes to monetary damages for the cost of the data breach, there is not a limiting factor, such as population size.

As we saw with Yahoo and more recently Equifax, cyber security software incidents show a similar pattern of exponential growth when it comes to costs. Direct financial costs are trackable, but the potential for external costs and risks should concern everyone.

#3 Bankrupt Companies and External Social Costs

From its inception no one would have predicted that this simple code pasted below might cost VW $22 billion US:

if (-20 /* deg */ < steeringWheelAngle && steeringWheelAngle < 20 /* deg */)

{

lastCheckTime = 0;

cancelCondition = false;

}

else

{

if (lastCheckTime < 1000000 /* microsec */)

{

lastCheckTime = lastCheckTime + dT;

cancelCondition = false;

}

else cancelCondition = true;

}

else cancelCondition = true;

}

Even if you argue that this is not an example of a software defect, but rather deliberate fraud, it’s unlikely you’d predict the real cost. Certainly, one was different, unexpected, not conforming to our expectations of a software defect. But that is the definition of a Black Swans. They do not conform to expectations, and as happened here the software did not act according to expectations. The result is that it cost billions.

How many companies can survive a 22 million dollar hit? Not many. What happens when a company we rely heavily on suddenly folds? Say the company that manages medical records in 1/5th of US states? Or a web-based company that provides accounting systems to clients in 120 countries just turns off?

#4 Our National Defense is at Risk

This one doesn’t take a lot to understand the significance, and yet it is one issue currently in the limelight. Software defects, faults, errors, etc. have the potential to produce extreme costs, despite infrequent occurrences. Furthermore, the origins of the costs of long tail events may not always be predictable.

After all what possible liability would Facebook have for real-world damages regarding international tampering in an election? It is all virtual, just information; until that information channel is misused.

There is very little chance that when actuaries for Facebook thought about election interference that they looked for such an area of risk. Sure they considered liability, people live broadcasting horrible and inhumane things, but did they contemplate foreign election interference? And even if they did consider the possibility, how would they have been able to predict or monitor the entry point?

And that is the long tail effect; it is not what we know, or can imagine, it is the unexpected. It is the bug that can’t be patched, as the rocket exploded, it is the criminal misuse of engine optimization routines or the idea that an election could be swayed due to misinformation. These events are so costly that we can’t assume that we know how bad it could be because the nature of software means that things will be as bad as they possibly can get.

#5 Your Death or Mine

Think of the movie IT, based off of Stephen King’s book by the same name. A clown that deceives children and leads them to death and destruction. What happens when a piece equipment runs haywire, masquerading as one thing and doing yet another? Software touches a great enough aspect of our lives that from the hospital setting to self-driving cars, a software defect could undoubtedly lead to death.

We’ve already had a case, presumably settled out of court, where a Therac-25 radiation therapy machine irradiated people to death. What happens when a cloud update to a control system removes fail-safes on hundreds or thousands of devices in hospitals or nursing homes? Who will be held liable for those deaths?

Mitigation is often an attempt at Prediction

A large part of software quality assurance is risk mitigation as an overlapping safety net to look for unexpected behaviors. Mitigation is an attempt to make it less likely that your company unintentionally finds the next “unexpected event.”

There has been a lot written about how there is an optimal way to get test coverage on your application. Most of this comes down to testing the system at the lowest level (unit test) that is feasible and has resulted in the testing pyramid. This is mathematically true. Unfortunately, the pyramid assumes that there are no gaps in coverage. Less overlap means that a gap in coverage at a lower level is less likely to be caught at a higher level.

The decision of test coverage and overlapping coverage can be approximated using Bernoulli trial, which delivers one of two results: success or failure.

Prioritizing the Magnitude Of Errors and their Effects

When we look at the expected chance of a defect and multiply that with the cost of a defect, we can compare that to the chance of a defect with overlapping coverage, multiplied by the cost.

We are usually looking at the cost of reducing the chance of a defect slipping through and comparing that to our estimated cost of a defect.

Unfortunately, the likelihood that we underestimate the cost of a defect due to long tail effects is very high. Yes, it is improbable that your industry will have a billion dollar defect discovered this year; but how about in the next 10 years? Now the answer becomes a maybe, let us call it a 10% chance and let us say that there are 100 companies in your industry. What is the cost of one of those outlier defects per year?

1,000,000,000 * .01 (1% chance per year) * .01 (1% chance of it hitting your company) = 100,000 per year as an expected cost for outlier defects per year.

The problem with outlier events is that despite their rare nature, even with a significantly small probability that your company might be the victim, the real outliers have the potential to be so big and expensive that it may, in fact, be worth your time investing in considering the possibility.

Enduring the Effect of a Black Swan

In reality, companies might use bankruptcy law to shield themselves from the full cost of one of these defects. VW’s financial burden for their expensive defect stems from the fact that they could afford it without going bankrupt. The reality is that most companies couldn’t afford to pay the costs of this type of event and would ultimately be forced to dissolve.

We cannot continue to ignore that software defects, faults, errors, etc. have the potential to produce extreme costs, despite infrequent occurrences. Furthermore, the origins of the costs of long tail events may not always be predictable.

The problem with “rarity of an event” as an insurance policy is that the costs of significant black swan bug events are that their risk goes beyond simple financial costs borne by individual companies. The weight of these long tail events is borne by society.

And so the question is, for how long and to what extent will society continue to naively or begrudgingly bear the cost of software defects? Sooner or later the law will catch up with software development. And software development will need to respond with improved quality assurance standards and improved software testing methodologies.

What do you think about these risks? How do you think we should address the potential costs?

Reference:

Tassey, G., Ph.D. (2002, May). Report02-3: The Economic Impacts of Inadequate Infrastructure for Software Testing [PDF]. Gaithersburg: RTI for National Institute of Standards and Technology.


Also published on Medium.

Leave a Reply