3 Risks to Every Team’s Progress and How to Mitigate

When looking at improving performance the first thought is often to increase the size of our development team; however, a larger group is not necessarily the only or the best solution. In this piece, I suggest several reasons to keep teams small and why to stop them from getting too tiny. I also look at several types of risk to consider when looking at team size: how team size effects communication, and the possibility of individual risk and systematic risk.

Optimal Team Size for Performance

The question of optimal team size is a perpetual debate in software organizations. To adjust, grow and develop different products we must rely on various sizes and makeups of teams.

We often assume that fewer people get less done, which results in the decision of adding people to our teams so that we can get more done. Unfortunately, this solution often has unintended consequences and unforeseen risks.

When deciding how big of a team to use, we must take into consideration several different aspects and challenges of team size. The most obvious and yet most often overlooked is communication.

Risk #1: Communication Costs Follow Geometric Growth

The main reason against big teams is communication. Adding team members results in a geometric growth of communication patterns and problems. This increase in communication pathways is easiest illustrated by a visual representation of team members and communication paths. 

Geometric Growth of Communication Paths

Bigger teams increase the likelihood that we will have a communication breakdown.

From the standpoint of improving communication, one solution that we commonly see is the creation of microservices to reduce complexity and decrease the need for constant communication between teams. Unfortunately, the use of microservices and distributed teams is not a “one size fits all” solution, as I discuss in my blog post on Navigating Babylon.

Ultimately, when it comes to improving performance, keep in mind that bigger is not necessarily better. 

Risk #2: Individual Risk & Fragility

Now a larger team seems like it would be less fragile because after all, a bigger team should be able to handle one member winning the lottery and walking out the door pretty well. This assumption is partially correct, but lottery tickets are usually individual risks (unless people pool tickets, something I have seen in a few companies).

When deciding how small to keep your team, make sure that you build in consideration for individual risk and be prepared to handle the loss of a team member.

Ideally, we want to have the smallest team as is possible while limiting our exposure to any risk tied to an individual. Unfortunately, fewer people tend to be able to get less work done than more people (leaving skill out of it for now).

Risk #3: Systematic Risk & Fragility

Systematic risk relates to events that will affect multiple people in the team. Fragility is the concept of how well structure or system can handle hardship (or changes in general). Systemic risks are aspects shared across the organization, this can be leadership, shared space, or shared resources.

Let’s look at some examples:

  • Someone brings the flu to a team meeting.
  • A manager/project manager/architect has surprise medical leave.
  • An affair between two coworkers turns sour.

All of these events can grind progress to a halt for a week (or much more). Events that impact morale can be incredibly damaging as lousy morale can be quite infectious.

In the Netherlands, we have the concept of a Baaldag (roughly translated as an irritable day) where team members limit their exposure to others when they know they won’t interact well. In the US with the stringent sick/holiday limits, this is rare.

Solutions to Mitigate Risk 

Now there are productive ways to minimize risk and improve communication. One way to do this is by carefully looking at your structure and goals and building an appropriate team size while taking additional actions to mitigate risk. Another effective technique for risk mitigation is through training. You shouldn’t be surprised, however, that my preferred method to minimize risk is by developing frameworks and using tests that are readable by anyone on your team.

The case for continuing education

Do you have job security? The surprising value in continuing education.

If the last 2 decades have taught us anything about change, they’ve shown that while software development may be one of the most rapidly growing and well-paid industries, it can also be highly unstable.

You may already invest in professional development in your free time. In this piece, I’ll show you how to convince your employer to invest in professional development as part of your job.

My Personal Story

I started my first software job at the height of the dot-com boom. I’d yet to finish my degree, but this didn’t matter because the demand for developers meant that just about anyone who merely knew what HTML stood for could get hired. Good developers could renew contract terms 2 or 3 times per year. Insanity reigned and some developers financially made out like bandits.

Of course, then came the crash came. The first crash happened just about the time I finished my degrees. By the time I graduated, I’d gone through three rounds of layoffs during my internships. By the time I actually started full-time work things had stabilized a bit, with layoffs settling down to once a year events in most companies. In 2007 we saw an uptick in a twice a year layoff habit for some companies, but then it quieted down again.

Of late, in most companies and industries software developer layoffs are less frequent. The more significant problem is, in fact, finding competent brains and bodies to fill open positions adequately.

My initial move into consultancy stemmed from a desire to take success into my own hands. Contracting and fulfilling specific project needs leaves me nimble and in control of my own destiny. My success is the happiness of my customer, and that is within my power.  Indeed, I am not immune to unexpected misfortune, but I rarely risk a sense of false security.  And I particularly enjoy the mentoring aspect of working as a consultant.

Despite the growth, I’d say software is still a boom and bust cycle.

Despite the relative calm (for the developers, not the companies), I think that as a software developer it is wise to accept that our work can vanish overnight or our salaries cut in half next month. Some people even leave the industry in hopes of better job security, while others deny the possibility that misfortune will ever knock on their door.

Not everyone has the desire, personality or the aptitude to be a consultant. However, everyone does have the ability to plan for and expect change. I wager that in any field it is wise to always have your next move in the back of one’s mind. This need to be prepared is particularly true in the area of software development. And while some people keep their resume fresh and they may even make a habit of annual practice interviews. Others have no idea which steps they’ll need to take to land their next job.

Landing that next job has some steps, and while the most straightforward step may be to make sure your resume and your LinkedIn profile are fresh with the right keywords (and associated skills) sprinkled throughout, it is even more important to stay on top of your game professionally.

Position yourself correctly, and you will fly through the recruiters’ hands into your next company’s lap. For many companies, keywords are not enough — they also need to know that you have experience with the most current versions and recent releases. Recruiters may not be able to tell you the difference between .Net 3.5 vs. 4.0; but if their client asks for only 4.0, they will filter out the 3.5 candidates. Versions are tricky, Angular 1 to 2 is a pretty big change, Angular 2 to 4 is tiny (and no, there is no Angular 3), it is not reasonable to expect recruiters to make heads or tails off of these versions.    

Constant Change Means Constant Learning

So how do you position yourself to leap if and when you need to? In the field of software development, new tools, methods, and practices are continually appearing. Software developers frequently work to improve and refine the trade and their products.

The result of this constant change is that for software engineers who maintain legacy products; you are at risk of losing your competitive edge. Staying at one job often results in developers becoming experts in software that will eventually be phased out.

Not surprisingly, the companies that rely on software to get their work done, but that are not actually software companies by trade tend to overlook professional development for their employees. The decision makers at these companies concern themselves with their costs more than the competitiveness of their employees and so they often remain entirely ignorant of the realities for their software engineers.

In some companies, from the decision makers’ point of view, they don’t see any logic in investing in training their employees or upgrading their software, when what they have works just fine. It’s easy to make a budget for a software upgrade, what is less evident is the cost of reduced marketplace competitiveness of their employees. Even worse, in some companies, there is an expectation that instead of investing in training, they’ll simply hire new people with the skills they need when their existing staff gets dated.

I once met a brilliant mathematician in Indianapolis that had worked on a legacy piece of software. One day after 40 years of loyal employment he found himself without a job due to a simple technology upgrade. With a skill set frozen circa 1980, he ended up working the remainder of his career in his neighborhood church doing administrative tasks and errands. Most people do not want to find themselves in that position, and they want to keep their economic prospects safe.

Maintain Your Own Competitive Edge

Another reason that many software engineers (and developers) move jobs every few years is to maintain their competitive edge and increase their pay. Indeed, earlier last year Forbes published a study showing that employees who stay longer than two years in a position tend to make 50% less than their peers who hop jobs.

“There is often a limit to how high your manager can bump you up since it’s based on a percentage of your current salary. However, if you move to another company, you start fresh and can usually command a higher base salary to hire you. Companies competing for talent are often not afraid to pay more when hiring if it means they can hire the best talent.”

More Important than Pay is the Software Engineer’s Fear of Irrelevance

As a software engineer working for a company that uses software (finance, energy, entertainment, you name it) there is nothing worse than seeing version 15 arrive on the scene when your firm remains committed to version 12.

Your fear is not that version 12 technology will phase out tech support, as these support windows are often a good decade in length. You fear that this release means that your expertise continues to become outdated and the longer you stay put, the harder it will be to get an interview, let alone snag a job. You feel a sinking dread that your primary skill-set is suddenly becoming irrelevant.  

Your dated skill-set has real financial implications and will eventually negatively impact your employability.

A Balancing Act

For companies, the incentive is to develop software cheaply, and cheap means that it is easy to use, quick to develop and let’s be realistic here, that you can Google the error message and copy your code from stack exchange.

A problem in software can often gobble up a few days when you are on the bleeding edge. All too often I stumble upon posts on a stack exchange where people answer their own question, often days later; or even worse I see questions responded to months after having asked for help. It makes sense that companies want to avoid the costs of implementing new releases.

Why would companies jump on the latest and greatest when the risk of these research problems is amplified in the latest version?

Companies are Motivated to Maintain Old Software, while employees are motivated to remain competitive.

This balancing act is a cost transfer problem; the latest framework is a cost to companies due to the research aspect, whereas an older framework is a cost to developers by reducing their marketability. At the moment where it is hard to hire good people, it will be hard to convince developers to bear the costs of letting their skills fall out of date.

New language and framework features can add value, but they are often minor, and there are often just ways to do something people can already do better and faster (but this is only true after the learning curve, and even then the benefits rarely live up to expectations (see No Silver Bullet). Chances are that the benefit of a new version of a framework will often outweigh the costs of learning the new framework, especially for existing code bases.

It seems like there should be some room for the middle ground; in the past, there was a middle ground. This was called the training budget.

Corporate Costs

With software developers jumping ship every few years to maintain their competitive edge, it is understandable that some management might find it difficult to justify or even expect a return on investment on training staff. In many cases, you’d need to break even on your training investment in less than a year.

At the same time, the need for developers to keep learning will never go away. Developers are acutely aware that having out of date skills is a direct threat to their economic viability.

For the near future developers will remain in high demand and the effects of refusing to provide on the job continuing education will only backfire. Developers are in demand, and they want to learn on the job. Today we do our learning on the production code, and companies pay the price (quite likely with interest). Whereas before developers were shipped off to conferences once a year, now they Google and read through the source code of the new framework on stack overflow for months as they try to solve a performance issue.

In Conclusion: Investing in Continuing Education Pays Off

The industry has gone through a lot of changes, in the dot-com boom developers were hopping jobs at an incredible speed, and companies reacted by changing how they treated developers and cut back on training as they saw tenures drop. This all makes perfect sense. Unfortunately, this has led to developers promoting aggressive and early adoption of frameworks so that developers keep their skills up-to-date with the market. And as more and more companies adapt to frequent updates, the pressure to do so will only increase.

Training provides a way to break the cycle and establish an unspoken agreement that companies will leave developers as competitive as they were when they were hired by regular maintenance through training. So how to support continuing education and maintain a stable and loyal development pool? Send your developers to conferences, host in-house training, lunch and learns, and so on to ensure that they feel both technically competitive and financially secure.  

Despite their reluctance, in the end, there is a real opportunity and a financial incentive for companies to go back to the training budget approach. Companies want to have efficient development, developers want to feel economically secure. If developers are learning then they feel like they are improving their economic prospects and remaining competitive. Certainly, some will still jump ship when it suits their professional goals, but many will chose to stay put if they feel they remain competitive.

“Advancement occurs through the education of practitioners at least as much as it does by the advancement of new technologies.” Improving Software Practice Through Education



How to play with LEGO when you should be testing

How To Play with LEGO® When You Should be Testing

There is a gap between automating your testing and implementing automated testing. The first is taking what you do and automating it; the second is writing your tests in an executable manner. You might wonder about the distinction that makes these activities different. And indeed, some testers will view these two events as parts of a single process.

However, for someone who reads test cases, for instance, your customer support, it is a big difference; what is a transparent process to one party, is often an opaque process to another. In fact, if you are a nontechnical person reading this explanation, you may very well be scratching your head trying to see the difference and understand why it is crucial.

Writing Tests versus Reading Tests

Automating involves looking at your interactions and your processes, both upstream and downstream, and then automating the part in between while maintaining those interactions. This automation process should mean that nontechnical people remain able to understand what your tests do, while also providing your developers the necessary detail of how to reproduce bugs.

The second piece, communicating the detailed information to your developers is relatively easy as you can log any operation with the service or browser; however, explaining these results to nontechnical people is often significantly more difficult.

When it comes to communicating the information, you have a few options; the first is to write detailed descriptions for all tests. Writing descriptions works, at least initially.

Regrettably, what tends to happen is that these descriptions can end up inaccurate without any complaints from the people who read your tests. If testing goes smoothly, then nothing will happen, and no one will notice the discrepancy. The problems only arise when something goes wrong.

Meaningful Tests Deliver Confidence

The worst problem an automated test implementation can have is one that erodes confidence in the results. And when a test fails, and the description does not match the implementation then you have suddenly undermined confidence in the entire black box that certifies how the software works.

And now, the business analyst cannot sleep at night after release testing, because there is a nagging suspicion something might not be right.

You might respond that keeping the descriptions updated accurately should be easy, and technically that should be true. However, the reality is that description writing (and updating) is often only one aspect of someone’s job.

Breaking Down the Problem

As a project progresses and accumulates tasks, particularly as a project falls behind schedule and or over budget, the individuals writing the descriptions are often too worried about many other details to focus proper attention on their descriptions.

And whether we like it or not when we move to automated testing we become prone to hacks and shortcuts just like everyone else. There is the simple reality that testing a new feature appears to be a more pressing priority than updating the comments on some old test cases.

Moreover, above and beyond your ability to remember or accurately update the comments, there is another point to consider. There is an underlying problem with these detailed descriptions as they violate one of the more useful rules of programming: DRY (Don’t Repeat Yourself).

One of the most important reasons to practice DRY is that duplicates all too easily get out of sync and cause systems to behave inconsistently. In other words, two bits that should do the same bit, now do slightly different things. Oops. Or in the case of documentation and automated tests; two bits that should mean the same bit are now out of sync. Double oops.

How do we avoid duplication and implement DRY?

We can use technologies that solve this process, such as using a natural language API, so that your tests read like English.

For Example:

This example is readable and executable in just about every language, but regrettably, to get to this form you will have to write a lot of wrappers and helpers so that the syntax is easy to follow.

And this means that you will likely then need to rely on writers with significant experience in your specific software language and we may not have someone with the right expertise on hand.

An alternative is to create a domain specific language in which you write your tests. A domain specific language means that you create tests in something like Gherkin/Cucumber or you write a proprietary parser and lexer. Of course, this path again relies on someone who has a lot of experience with API / Language design.

The Simplest Solution

To me the preferred method is to use Gherkin; mostly because it is easier to maintain after your architect wins the lottery and moves to Bermuda. With Gherkin when you run into a problem you can Google the answer, or hire specialist. There is a sense of mystery and adventure about working on issues where you can’t Google the answer, at the same time, it’s not necessarily a reliable business practice.

The most significant benefit that I’ve discovered is that for many people, this method no longer feels like programming. This statement undoubtedly seems odd coming from a programmer, but hear me out as there is a method to my madness.

Solutions for the People You Have

To begin, let’s acknowledge that there is a shortage of programmers, especially programmers that are experts in your particular business. Imagine if you had a tool that you could hand to anyone who knows your field and that would allow this individual to understand and define tests? Wouldn’t that be grand?

How would this tool look? What would it require? To accomplish readability (understanding) and functionality (definition) you’d need to be able to hand off something that is concrete (versus abstract) and applicable to the business that you are in, but most importantly it needs to have a physical appearance.


Imagine if you could design tests with LEGO®? There is nothing you can build out of Lego bricks that you couldn’t create in a wood shop. Unfortunately, most people are too intimidated to make anything in a woodshop. Woodworking is the domain of woodworkers. On the flipside, give anyone a box of Lego bricks, and they will get to work building out your request.

Software development runs into the woodworker conundrum: programming is the domain of developers. Give a layperson C#, Java or JavaScript and assign them to a project to build and they’ll get so flustered they won’t even try. Give them Lego bricks, and they will at the least try to build out their assignment.

Reducing the barrier to accomplishing something new is extremely important for adoption; we know this is a barrier to getting customers to try our software, but we often forget the same rule applies to adopting new things for our teams.

This desire for something concrete to visualize is why we come across people who would “like to learn to program” while building complicated Excel sheets filled with basic visual formulas. These folks can see Excel so they naturally can use this formula thing, and don’t even realize that they are programming. My method is similar in concept.

To successfully automate our testing we need to reduce the barriers to trying to use the technology we plan to produce tomorrow.  As an organization adopts change, we need to find ways to make changes transparent and doable; we need to convince our people that they will succeed or they might not even try.

Gherkin Lego bricks

As I wrote in my post Variables in Gherkin:

“The purpose of Cucumber is to provide a clear and readable language for people (the humans) who need to understand a test’s function. Cucumber is designed to simplify and clarify testing.

For me, Cucumber is an efficient and pragmatic language to use in testing, because my entire development team, including managers and business analysts, can read and understand the tests.

Gherkin is the domain-specific language of Cucumber. Gherkin is significant because as a language, it is comprehensible to the nontechnical humans that need to make meaning out of tests.

In simpler terms: Gherkin is business readable.”

This explanation shows how Gherkin is the perfect language for our Lego bricks, to build testing “building blocks.” To create an infrastructure that is readable by and function for the people you have on hand, I like to develop components that I then provide to the testers.

Instead of needing to have a specific technical understanding, such as a particular programming language the testers now just need to know which processes they need to test.

For example, I would like to:

  • Open an order = a blue 2×4 brick
  • Open a loan = a blue 2×6 brick
  • Close a loan = a red 2×4 brick
  • Assign a ticket = a green 2×4 brick

This method addresses the issue of DRY because each “brick” has its process, so if you need to change the process, you change out the brick. If a process is broken, you pass it back to the development team, but it is very precise. It makes it concrete and removes a lot of the abstract parts inherent in software development.

⇨ This method addresses the issue of readability because each brick is a concrete process. Your testers can be less technical while producing meaningful automated tests.

⇨ This method solves the problem of confidence because problems are isolated to bricks. If one brick is broken, it doesn’t hint at the possibility that all bricks are broken.

⇨ This method also solves the problem of people, because it’s much easier to find testers who understand your business process and goals, such as selling and closing mortgages, without also having to understand the abstract nature of the underlying software that makes it all work.

The reality is that in this age every company has software while few are software companies.  Companies rely on software to deliver something concrete to their customers. My job as an automation and process improvement specialist is to make the testing of the software piece as transparent and as painless as possible so that your people can focus on your overarching mission.

“LEGO®is a trademark of the LEGO Group of companies which does not sponsor, authorize or endorse this site”.

The Cost of Software Bugs: 5 Powerful Reasons to Get Upset

If you read the PossumLabs blog regularly, you know already that I am focused on software quality assurance measures and why we should care about implementing better and consistent standards. I look at how the software quality assurance process affects outcomes and where negligence or the effects of big data might come into play from a liability standpoint. I also consider how software testing methodologies may or may not work for different companies and situations.

If you are new here, I invite you to join us on my quest to improve software quality assurance standards.

External Costs of Software Bugs

As an automation and process improvement specialist, I am somewhat rare in my infatuation with software defects, but I shouldn’t be. The potential repercussion of said bugs is enormous.

And yet you ask, why should YOU care?

Traditional testing focuses on where in the development lifecycle a bug is found and how to reduce costs. This is the debate of Correction vs. Prevention and experience demonstrates that prevention tends to be significantly more budget-friendly than correction.

Most development teams and their management have a singular focus when it comes to testing: they want to deliver a product that pleases their customer as efficiently as possible. This self-interest, of course, focuses on internal costs. In the private sector profit is king, so this is not surprising.

A few people, but not many, think about the external costs of software defects. Most of these studies and the interested parties tend to be government entities or academic researchers. In this

In this article, I discuss five different reasons that you as a consumer, a software developer or whomever you might be, should be concerned with the costs of software bugs to society.

#1 No Upper Limit to Financial Cost

The number one reason that we should all be concerned is that in reality software costs for defects, misuse or crime likely have no upper limit on their expense.

In 2002 NIST compiled a detailed study looking at the costs of software bugs and what we could do to both prevent and reduce costs, not only within our own companies but also external societal costs. The authors attempted to estimate how much software defects cost different industries. Based on these estimates they then proposed some general guidelines.

Although an interesting and useful paper, the most notable black swan events over the last 15 years demonstrate that these estimates provide a false sense of security.

For example, when a bug caused $500 million US in damage with the Ariana 5 rocket launch failure, observers treated it like a freak incident. At the time, little did we know that the financial cost of freak incident definition would continue to grow a few orders of magnitude just a few years later.

This behavior goes by many names, Black Swans, long tails, etc. What it means is that there will be extreme outliers. These outliers will defy any bell curve models, they will be rare, they will be unpredictable, and they will happen.


Black Swan is an unpredictable event as so named by Nassim Nicholas Taleb in his book The Black Swan: The Impact of the Highly Improbable. It is predicted that the next Black Swan will come from Cyberspace.

Long tail refers to a statistical event in which most events will happen in a specific range whereas a few rare events will occur at the end of the tail. https://en.wikipedia.org/wiki/Long_tail

Of course, it is human nature always to try and assemble the clues that might lead to predicting a rare event.

Let’s Discuss Some Examples:

4 June 1996

A 64-bit integer is written to a 16-bit value, and 500 million dollars went up in flames. As you see in Table 6-14 (page 135), as published in the previously mentioned NIST study, the estimated cost for software defects for the aerospace industry for a company this size was only $1,289,167. And so, 500 million blows that estimate right out of the water.

This single bug cost 200 times the expected annual cost of defects for a company.

May 2007

A startup routine for engine warm-up is released with some new conditions. The estimate for the automotive industry’s cost of software bugs in 2002; per company, per year as seen in Table 6-14 (page 135). Company Costs Associated with Software Errors and Bugs Automotive for a company bigger than 10,000 was only $2,777,868. That is not even a dent in the cost to Volkswagen — this code cost Volkswagen 22 Billion dollars.

That equates to about 10,000 times the expected costs of defects per company per year.

This behavior goes by many names, Black Swans, long tails, etc. What it means is that there will be extreme outliers. These outliers will defy any bell curve models, they will be rare, they will be unpredictable, and they will happen.

It is human nature always to try and assemble the clues that might lead to predicting a rare event. Unfortunately, when it comes to liability, it seems only academics are interested in this type of prediction, but given the possibility of exponential costs to a single company, shouldn’t we all be concerned?

#2 Data Leaks: Individual Costs of Data Loss?

Data leaks of 10-100 million customers are becoming routine. These leaks are limited by the size of the datasets and thus unlikely to grow much more. In large part that is because not many companies have enough data to move into the billions of records data breaches.

Facebook has ~2 billion users, the theoretical limit of a data breach is therefore limited to Facebook, or a Chinese or Indian government system. We only have 7.5 billion people on earth so to have a breach of 10 billion users we first need more people.

Security Breaches are limited by the Human Population Factor

That is what makes security breaches different, the only thing that it tells us is that we will approach the theoretical limit of how bad it could be. The Equifax breach affected 143 million users.

When it comes to monetary damages for the cost of the data breach, there is not a limiting factor, such as population size.

As we saw with Yahoo and more recently Equifax, cyber security software incidents show a similar pattern of exponential growth when it comes to costs. Direct financial costs are trackable, but the potential for external costs and risks should concern everyone.

#3 Bankrupt Companies and External Social Costs

From its inception no one would have predicted that this simple code pasted below might cost VW $22 billion US:

if (-20 /* deg */ < steeringWheelAngle && steeringWheelAngle < 20 /* deg */)


lastCheckTime = 0;

cancelCondition = false;




if (lastCheckTime < 1000000 /* microsec */)


lastCheckTime = lastCheckTime + dT;

cancelCondition = false;


else cancelCondition = true;


else cancelCondition = true;


Even if you argue that this is not an example of a software defect, but rather deliberate fraud, it’s unlikely you’d predict the real cost. Certainly, one was different, unexpected, not conforming to our expectations of a software defect. But that is the definition of a Black Swans. They do not conform to expectations, and as happened here the software did not act according to expectations. The result is that it cost billions.

How many companies can survive a 22 million dollar hit? Not many. What happens when a company we rely heavily on suddenly folds? Say the company that manages medical records in 1/5th of US states? Or a web-based company that provides accounting systems to clients in 120 countries just turns off?

#4 Our National Defense is at Risk

This one doesn’t take a lot to understand the significance, and yet it is one issue currently in the limelight. Software defects, faults, errors, etc. have the potential to produce extreme costs, despite infrequent occurrences. Furthermore, the origins of the costs of long tail events may not always be predictable.

After all what possible liability would Facebook have for real-world damages regarding international tampering in an election? It is all virtual, just information; until that information channel is misused.

There is very little chance that when actuaries for Facebook thought about election interference that they looked for such an area of risk. Sure they considered liability, people live broadcasting horrible and inhumane things, but did they contemplate foreign election interference? And even if they did consider the possibility, how would they have been able to predict or monitor the entry point?

And that is the long tail effect; it is not what we know, or can imagine, it is the unexpected. It is the bug that can’t be patched, as the rocket exploded, it is the criminal misuse of engine optimization routines or the idea that an election could be swayed due to misinformation. These events are so costly that we can’t assume that we know how bad it could be because the nature of software means that things will be as bad as they possibly can get.

#5 Your Death or Mine

Think of the movie IT, based off of Stephen King’s book by the same name. A clown that deceives children and leads them to death and destruction. What happens when a piece equipment runs haywire, masquerading as one thing and doing yet another? Software touches a great enough aspect of our lives that from the hospital setting to self-driving cars, a software defect could undoubtedly lead to death.

We’ve already had a case, presumably settled out of court, where a Therac-25 radiation therapy machine irradiated people to death. What happens when a cloud update to a control system removes fail-safes on hundreds or thousands of devices in hospitals or nursing homes? Who will be held liable for those deaths?

Mitigation is often an attempt at Prediction

A large part of software quality assurance is risk mitigation as an overlapping safety net to look for unexpected behaviors. Mitigation is an attempt to make it less likely that your company unintentionally finds the next “unexpected event.”

There has been a lot written about how there is an optimal way to get test coverage on your application. Most of this comes down to testing the system at the lowest level (unit test) that is feasible and has resulted in the testing pyramid. This is mathematically true. Unfortunately, the pyramid assumes that there are no gaps in coverage. Less overlap means that a gap in coverage at a lower level is less likely to be caught at a higher level.

The decision of test coverage and overlapping coverage can be approximated using Bernoulli trial, which delivers one of two results: success or failure.

Prioritizing the Magnitude Of Errors and their Effects

When we look at the expected chance of a defect and multiply that with the cost of a defect, we can compare that to the chance of a defect with overlapping coverage, multiplied by the cost.

We are usually looking at the cost of reducing the chance of a defect slipping through and comparing that to our estimated cost of a defect.

Unfortunately, the likelihood that we underestimate the cost of a defect due to long tail effects is very high. Yes, it is improbable that your industry will have a billion dollar defect discovered this year; but how about in the next 10 years? Now the answer becomes a maybe, let us call it a 10% chance and let us say that there are 100 companies in your industry. What is the cost of one of those outlier defects per year?

1,000,000,000 * .01 (1% chance per year) * .01 (1% chance of it hitting your company) = 100,000 per year as an expected cost for outlier defects per year.

The problem with outlier events is that despite their rare nature, even with a significantly small probability that your company might be the victim, the real outliers have the potential to be so big and expensive that it may, in fact, be worth your time investing in considering the possibility.

Enduring the Effect of a Black Swan

In reality, companies might use bankruptcy law to shield themselves from the full cost of one of these defects. VW’s financial burden for their expensive defect stems from the fact that they could afford it without going bankrupt. The reality is that most companies couldn’t afford to pay the costs of this type of event and would ultimately be forced to dissolve.

We cannot continue to ignore that software defects, faults, errors, etc. have the potential to produce extreme costs, despite infrequent occurrences. Furthermore, the origins of the costs of long tail events may not always be predictable.

The problem with “rarity of an event” as an insurance policy is that the costs of significant black swan bug events are that their risk goes beyond simple financial costs borne by individual companies. The weight of these long tail events is borne by society.

And so the question is, for how long and to what extent will society continue to naively or begrudgingly bear the cost of software defects? Sooner or later the law will catch up with software development. And software development will need to respond with improved quality assurance standards and improved software testing methodologies.

What do you think about these risks? How do you think we should address the potential costs?


Tassey, G., Ph.D. (2002, May). Report02-3: The Economic Impacts of Inadequate Infrastructure for Software Testing [PDF]. Gaithersburg: RTI for National Institute of Standards and Technology.

A Crucial Look at the Unethical Risks of Artificial Intelligence

Artificial Intelligence Pros and Cons:

As much as we wonder at the discoveries and the artificial intelligence benefits to society of AI and prediction engines, we also recoil at some of their findings. We can’t make the correlations that this software discovers go away, and we can’t stop the software from re-discovering the associations in the future. As decent human beings, we certainly wish to avoid our software making decisions based on unethical correlations.

Ultimately, what we need is to teach our AI software lessons to distinguish good from bad…

Unintended results of AI: an example of the disadvantage of artificial intelligence.

A steady stream of findings already makes it clear that AI efficiently uses data to determine characteristics of people. Simplistically speaking, all we need is to feed a bunch of data into a system and then that system figures out formulas from that data to determine an outcome.

For example, more than a decade ago in university classes, we ran dome tests on medical records trying to find people that had cancer. We coded the presence of disease onto our training data, which we then scanned for correlations to other medical codes present.

The algorithm ran for about 26 hours. In the end, we scanned the data for accuracy, and needless to say, the system returned fantastic results. The system reliably honed in on a medical code that predicted cancer; and more specifically, the presence of tumors.

Of course, at the outset, we’d like to assume this data will go to productive, altruistic uses. At the same time, I’d like to emphasize that the algorithm delivered the response: “well, of course, that is the case,” substantially demonstrating that such a program can discover correlations without being explicitly told what to look for…

Researchers might develop such a program with the intention to cure cancer, but what happens if it gets into the wrong hands? As we well know, not everyone, especially when driven by financial gain, is altruistically motivated. Realistically speaking, if we use a program looking for correlations to guide research leading to scientific discoveries for good intent, it can also be used for bad.

The Negative Risk: Unethical Businesses

By function and design, algorithms naturally discriminate. They distinguish one population from another. The basic principles can be used to determine a multitude of characteristics: sick from healthy; gay from straight; and, black from white.

Periodically the news picks up an article that illustrates this facility. Lately, it’s been a discussion of facial recognition. A few years ago the big issue revolved around Netflix recommendations.

The risk is that this kind of software can likely figure out, for example, if you are gay with varying levels of certainty. Depending on the data available, AI software can figure out figure out all sorts of other information that we may or may not want it to know or that we may not intend for it to understand.

When it comes to the ethics and adverse effects of artificial intelligence, it’s all too easy to toss our hands in the air and have excited discussions around the water cooler or over the dinner table. What we can’t do is simply make it we can’t make it go away. This concern is a problem that we must address.

Breakthrough: The Problem is its own Solution

Up to this point, my arguments may sound depressing. The good news is that the source of the problem is also the source of the solution.

If this kind of software can determine from data sets the factors (such as the presence of tumors) that we associate with a discrimination (such as the presence of cancer), we can then take these same algorithms and tell our software to ignore the results.

If we don’t want to know this kind of information, simply ignore this type of result. And then, we can then test to verify that our directives are working and our software is not relying on the specified factors in our other algorithms.

For instance, say we determine that as part of a determination of the risk of delinquent payment for a mortgage, we know that our algorithm can also determine gender, race or sexual orientation. Rather than using this data, which is likely a wee bit racist, sexist, and bigoted, when calculating a mortgage rate recommendation, we could ask it to ignore said data.

In fact, we could go even further. Just as we have equal housing and equal employment legislation, we could carry over to legislate that if a set of factors can be used to discriminate, then software should be instructed to disallow the combining of those elements in a single algorithm.

Discussion: Let’s look at an analogy.

Generally speaking, US society legislates that Methamphetamine is bad, and people should not make it, but at the same time the recipe is known, and we can’t uninvent meth.

An unusual tactic is to publicise the formula and tell people not to mix the ingredients into their bathtub “accidentally.” If we find people preparing to combine the known ingredients, we can then, of course, take legal action.

For software, I’d recommend that we take similar steps and implement a set of rules. If and when we determine the possible adverse outcomes of our algorithms, we can require that the users (business entities) cannot combine the said pieces of data into a decision algorithm, of course making an exception for those doing actual constructive research into data-ethical issues.

The Result: Constructing and or Legislating a Solution

Over time our result would be the construction of a dataset of ethically sound and ethically valid correlations that could be used to teach software what it is allowed to do. This learning would not happen overnight, but it also might not be as far down the line as we first assume.

The first step would be to create a standard data dictionary where people and companies would be able to share what data they use, similar to elements on the chemical periodic table. From there we would be ready to look for the good and the bad kinds of discrimination. We can take the benefits of the good while removing the penalties from the bad.

This process might mean that some recommendations would possibly have to ask if it would be allowed to utilize data that could be used to discriminate based upon an undesirable metric (like race). And it might mean that in some cases it would be illegal to combine specific pieces of data, such as for a mortgage rate calculation.

No matter what we choose to do, we can’t close Pandora’s box. It is open; the data exists, the algorithms exist; we can’t make that go away. Our best bet is to put in the effort to teach software ethics, first by hard rules, and then hopefully let it figure some things out on its own. If Avinash Kaushik’s predictions are anywhere near accurate, maybe we can teach software actually to be better than humans at making ethical decisions, only the future will tell!

If you’re curious about the subject of AI and Big Data read more in my piece Predicting the Future.

Why Negligence in Software should be of Urgent Concern to You

The future of Liability in Software:

Many things set software companies apart from other businesses. A significant, but an often overlooked difference, is that the manufacturers of software exhibit little fear of getting sued for negligence, including defects or misuse of their software. For the moment, the legal consequences of defective software remain marginal.

After more than a decade, even efforts to reform the Uniform Commercial Code (UCC) to address particularities of software licensing and software liabilities remain frozen in time. As Jane Chong discusses in We Need Strict Laws, the courts consistently rule in favor of software corporations over consumers, due to the nature of contract law, tort law, and economic loss doctrine. In general, there is little consensus regarding the question: should software developers be liable for their code?

Slippery When Wet

If you go to your local hardware store, you’ll find warning signs on the exterior of an empty bucket. Look around your house or office, and you see warning labels on everything from wet-floors to microwaves and mattresses.

In software, if you are lucky you might find an EULA buried somewhere behind a link on some other page. Most products have a user manual, why is it not enough to print your “warning” inside? An easy to find, easy to implement as a standard.

Legal issues in software development: why is there no fear?

Fear is socially moderated and generated. Hence the term “mass hysteria.” We fear things that we’ve already experienced or that have happened to personal connections. We all too easily imagine horrible events that have befallen others. In an age of multimedia, this is a rather flawed system that has gone berserk or as they say “viral.” What we see has left us with some pretty weak estimates on the odds of events, like shark attacks or kidnappings.

One reason we don’t fear lawsuits around software is that we don’t see them in the public sphere. They do happen, but all too often the cases never make it very far. Or the judge rules in favor of the software developer. Interpretation of the laws makes it difficult to prove or attribute harm to a customer.

To date, we’ve yet to see a Twittergate on negligence in software development. This lack of noise doesn’t mean that no one has reasons to sue. Instead, it is more of an indicator that for the moment, the law and the precedent are not written to guide software litigation. And with little news or discussion in the public sphere, no one has the “fear.”


A Matter of Time

Frankly, it is not a matter of will it happen, but when? What is the likelihood that in the next five years there will be successful suits brought against software firms? Should we make a bet?

Software development is an odd industry. We leave an incredible electronic trail of searchable data for everything that we do. We track defects, check-ins, test reports, audit trails, and logs. And then we back up all of them. Quite often these records last forever. Or at least as long, if not longer than the company that created the record.

Even when we change from one source control system to another, we try to make sure that we keep a detailed record intact just in case we want to go back in time.

This level of record keeping is impressive. The safety it provides and the associated forensic capabilities can be incredibly useful. So what is the catch? There is a potential for this unofficial policy of infinite data retention to backfire.

Setting the Standard

Most companies follow standard document retention policies that ensure businesses save both communications and artifacts to meet fiscal or regulatory requirements for a required period then eventually purged after some years.

Ironically, even in companies who follow conservative document retention policies, the source control, and bug tracking system is often completely overlooked if not flat out ignored. From a development perspective, this makes sense: data storage isn’t expensive, so why not keep it all?

The Catch & The Cause

The reason that document retention policies exist is not merely to keep companies organized and the IRS happy, it’s because of the potential for expensive lawsuits. Companies want to make sure that they can answer categorically why certain documents do or do not exist.

For instance, let’s say your company makes widgets and tests these before shipping them on; you don’t want to say that “we don’t know why we don’t have those test results.” By creating a documented process around the destruction of data (and following it) you can instead point to the document and say the data does not exist — it’s been destroyed according to “the policy.”

Policy? What Policy?

This example takes us back to the story of liability in software. In the software business we often keep data forever, but then we also delete data in inconsistent bursts. Maybe we need to change servers, or we are running out of space, or we find random backups floating around on tapes in vaults. So we delete it or shred it or decide to move to the Cloud to reduce the cost of storage.

This type of data doesn’t tend to carry any specific policy or instructions for what we should and shouldn’t include, or how best to store the data, and so on.

What’s more, when we document our code and record check-ins, we don’t really think about our audience. Or default audience is likely ourselves or the person(s) in the cube next to us. And our communication skills on a Friday night after a 60-hour-week won’t result in the finest or most coherent checking comments, especially if our audience ends up being someone besides our cubemate.

The reason that this type of mediocre record keeping persists is that it remains difficult to sue over software. There really are no clear-cut ways to bring a suit for defective services provided. If you live in the US you know the fear of lawsuits over slips and falls; this fear does not exist for creating a website and accepting user data.

Walking a Thin Line

My guess is that this indifference to record keeping and data retention will persist as long as potential suitors must do all the groundwork before seeing any money. And, as long as judges continue to side with corporations and leave plaintiffs in the lurch. However, as soon as that first case sneaks through the legal system and sets a precedent, anyone and everyone just may reference that case.

Ironically, patents or copyright protection don’t travel with theories presented in a trial, which means that once a case makes it through the system, the case only needs to be referenced. Suggesting that if one lawyer learns how to sue us; they all do. Think of it as an open source library you can reference, once it exists anyone gets to use it.

I expect that there will be a gold rush, we are just waiting for the first prospector to come to town with a baggy of nuggets.

As to what companies can do? For now, create an inventory of what data you keep and how it compares to any existing policies. This may involve sitting down in a meeting that will be an awkward mix of suits and shorts where there likely will be a slide titled “What is source control?” There is no right answer, and this is something for every company to decide for themselves.

Where does your development process leave a data trail? Has your company had discussions about document retention and source control?

How to Effortlessly Take Back Control of Third Party Services Testing

Tools of the Trade: Testing in with SaaS subsystems.

For the last few years, the idea has been floating around that every company is a software company, regardless of the actual business mission. Concurrently even more companies are dependent upon 3rd party services and applications. From here it is easy to extrapolate that the more we integrate, the more likely it is that at some point, every company will experience problems with updates: from downtime, uptime, and so on.

Predicting when downtime will happen and or forcing 3rd party services to comply with our needs and wishes is difficult if not impossible. One solution to these challenges is to build a proxy. The proxy allows us to regain a semblance of control and to test 3rd party failures. It won’t keep the 3rd parties up, but we can simulate failures whenever we want to.

As an actual solution, this is a bit like building a chisel with a hammer and an anvil. And yet, despite the rough nature of the job, it remains a highly useful tool that facilitates your work as a Quality Assurance Professional.

The Problem

Applications increasingly use and depend upon a slew of 3rd party integrations. In general, these services tend to maintain decent uptime and encounter only rare outages.

The level of reliability of these services leads us to continue to add more services to applications with little thought to unintended consequences or complications. We do this because it works: it is cheap, and it is reliable.

The problem (or problems) that arise stem from the simple nature of combining all of these systems. Even if each service maintains good uptimes, errors and discordant downtime may result in conditions where the time that all your services are concurrently up is not good enough.

The compounding uptimes

Let’s look at a set of services that individually boast at least 95% uptime. Let’s say we have a service for analytics, another for billing, another for logging, another for maps, another for reverse IP, and yet another for user feedback. Individually they may be up 95% of the time, but let’s say that collectively the odds of all of them being up at the same time is less than 75%.

As we design our service, working with an assumption of around a 95% up-time scenario feels a lot better than working with a chance of only 75% uptime. To exacerbates this issue, what happens when you need to test how these failures interact with your system?

Automated Testing is Rough with SaaS

To create automated tests around services being down is not ideal. Even if the services consumed are located on site, it is likely difficult to stop and start them using tests. Perhaps you can write some PowerShell and make the magic work. Maybe not.

But what happens when your services are not even located on the site? The reality is that a significant part of the appeal of third-party services is that businesses don’t really want onsite services anymore. The demand is for SaaS services that remove us from the maintenance and upgrade loop. The downside to SaaS means that suddenly turning a service off becomes much more difficult.

The Solution: Proxy to the Rescue

What we can do is to use proxies. Add an internal proxy in front of every 3rd party service, and now there is an “on / off switch” for all the services and a way to do 3rd party integration testing efficiently. This proxy set-up can also be a way to simulate responses under certain conditions (like a customer Id that returns a 412 error code).

Build or buy a proxy with a simple REST API for administration and it should be easy to run tests that simulate errors from 3rd party providers. Now we can simulate an entire system outage in which the entire provider is down.

By creating an environment isolated by proxies, test suites can be confidently run under different conditions, providing us with valuable information as to how various problems with an application might impact our overall business service.

Proxy Simulations

Upon building a proxy in between our service and the 3rd party service, we can also put in the service-oriented analog of a mock. This arrangement means we can create a proxy that generates response messages for specific conditions.

We would not want to do this for all calls, but for some. For example, say we would like to tell it that user “Bob” has an expired account for the next test. This specification would allow us to simulate conditions that our 3rd party app may not readily support.

Customer specific means better UX

By creating our own proxy, we can return custom responses for specific conditions. Most potential conditions can be simulated and tested before they happen in production. And we can see how various situations might affect the entire system. From errors to speed, we can simulate what would happen if a provider slows down, a provider goes down, or even a specific error, such as a problem that arises when closing out the billing service every Friday evening.

Beyond Flexibility

Initially, the focus might be on scenarios where you simulate previous failures of your 3rd party provider; but you can also test for conditions for which your 3rd party may not offer to you in a test environment. For instance, expiring accounts in mobile stores. With a proxy, you can do all of this by merely keying off of the specific data that you know will come in the request.

In Conclusion: Practical Programming for the Cloud

This proxy solution is likely not listed in your requirements. At the same time, it is a solution to a problem that in reality is all too likely to arise once you head to production.

In an ideal world, we wouldn’t need to worry about 3rd party services.

In an ideal world, our applications would know what failures to ignore.

In the real world, the best course is preparation: this allows us to test and prevent outages.

In reality, we rarely test for the various conditions that might arise and cause outages or slowdowns. Working with and through a 3rd party to simulate an outage is likely very difficult, if not impossible. You can try and call Apple Support to request that they turn off their test environment for the App store services, but they most likely won’t.

This is essentially a side-effect from the Cloud. The Cloud makes it is easy to add new and generally reliable services, which also happen to be cheap and makes the Cloud an all-around good business decision. It should not then be surprising that when you run into testing problems, an effective solution will also come from the Cloud. Spinning up small, lightweight proxies for a test environment is a practical solution for a problem in the Cloud.