To Test The Untestable Bug.

To Test The Untestable Bug. 45

Posted by Uncle Bob Thu, 13 Aug 2009 18:56:12 GMT

I found a bug that couldn’t be tested yesterday.

There is nothing quite so satisfying as isolating a bug, writing a test that fails because of that bug, and then making the test pass by fixing the bug. The fact that the test is passing is proof that the bug is fixed. The existence of that test in the test suite ensures that the bug will never come back again.

For ten years I have worked this way. For ten years I have taught, and coached, and otherwise cajoled software developers to work this way too. Sometimes those developers would ask me about bugs that couldn’t be tested; and I would reject the whole concept. “If you can write a line of code, you can test that line of code.” I would reply.

But yesterday I found a bug that I couldn’t write a test for.

Actually, that’s not quite fair. I was able to write a test for the bug. The problem was that the test passed, and I couldn’t make it fail.

But let me set the stage first…

My son Justin is apprenticing with me this summer. He’s been helping me write new features for FitNesse. We have had many pleasant weeks working together.

About a week ago I asked him to rework an area of the FitNesse code that includes SuiteSetUp and SuiteTearDown pages into test suites. These pages are executed once per suite run. They are similar to @BeforeClass and @AfterClass functions in JUnit.

The old mechanism started with the suite page and found the nearest inherited SuiteSetUp and SuiteTearDown for that page. Then it would include the SuiteSetUp, all the tests, and finally the SuiteTearDown. The new mechanism needed to be smarter. It would look at every test page and determine the most appropriate SuiteSetUp and SuiteTearDown for that page. Then it would group the test pages by common SuiteSetUp and SuiteTearDown pairs and include them in the suite.

OK, I know this is a bit detailed, but I’m almost done. The point is that the old mechanism created a list of tests that looked like this:

Utttt...ttttD

Whereas the new mechanism created a list of tests that looked like this:

UtttDUttDUtttttDUtD...

Where U and D are SuiteSetUp and SuiteTearDown pages, and the t’s are test pages.

Justin got all his unit tests passing, cleaned up the code, and then showed it to me. We spent a little more time cleaning it up, until we were both satisfied with it.

Then Justin said to me: “We should write an acceptance test for this.”

We write a lot of acceptance tests for FitNesse; and, of course, we use FitNesse to write those tests. There are over 200 FitNesse tests for FitNesse. But until Justin said it, it didn’t occur to me that this feature should have an acceptance test. But after some thought, it was clear that it should.

So Justin and I set about to write the acceptance test for this feature. It was actually pretty straight forward. We simply used a shadow instance of FitNesse to create a test suite with several test pages and appropriate SuiteSetUp and SuiteTearDown pages. Then we executed the suite in the shadow instance, and dissected the HTML to ensure that the setups and teardowns were included in the right order. Easy.

We used a Slim OrderedQuery table to ensure that the list of tests, setups, and teardowns was in the right order. One of the Ordered Query tables looked like this:

!|ordered query:pages run in suite|SuiteChildOne  |
|page name                                        |
|SuiteChildOne.SuiteSetUp                         |
|SuiteChildOne.TestOneOne                         |
|SuiteChildOne.TestOneTwo                         |
|SuiteChildOne.SuiteTearDown                      |                                     
|SuiteChildOne.SuiteSetUp                         |
|SuiteChildOne.TestOneThree                       |
|SuiteChildOne.SuiteTearDown                      |

The TestOneOne and TestOneTwo pages were Fit tests. The TestOneThree page was a slim test. These kinds of tests execute in different JVMs, therefore they needed to be run separately, with their own execution of SuiteSetUp and SuiteTearDown.

Oddly, this table failed, and the failure was inexplicable. It kept complaining that the rows were out of order, when they clearly were correct.

After a little inspection and fiddling, Justin said: “I think the Ordered Query table doesn’t like duplicates.”

Having written the Ordered Query table about 6 months ago, I reviewed the code in my mind and realized he probably had a point. The Ordered Query table was derived from Query table. Query tables check that all specified rows exist in a query, but don’t care about order. Ordered Query tables ensure that order is correct too. Having based the Ordered Query algorithm on the Query algorithm, I realized that there might indeed be a duplication issue.

So, we modified the test to eliminate the need for duplicates; and I wrote a lighthouse task that described the bug.

Yesterday I went to lighthouse and saw that bug there. I decided to fix it. So I wrote an acceptance tests expecially for it. It looked like this:

!|Ordered Query:duplicate rows|A|
|x                              |
|SuiteChildOne.SuiteSetUp       |
|SuiteChildOne.TestOneOne       |
|SuiteChildOne.TestOneTwo       |
|SuiteChildOne.SuiteTearDown    |
|SuiteChildOne.SuiteSetUp       |
|SuiteChildOne.TestOneThree     |
|SuiteChildOne.SuiteTearDown    |

Note the similarity?

Oddly, the test passed. I scratched my head about this, and tried several other different combinations. They all worked just fine. It was as though the bug had disappeared.

I began to doubt that the bug had ever existed. I started rationalizing that perhaps Justin and I had been so focussed on ordering the suites that we didn’t really diagnose the problem properly, and that we had made some silly cockpit error and mistaken it for a bug in OrderedQuery.

So I went to lighthouse and invalidated the bug. Fortunately for me, I left the new acceptance test in place!...

I completed a few more lighthouse tasks, did a git commit, and prepared to push the changes to github. It is our rule that you never push to github without running an ant release. This is the same script that our Hudson server runs whenever it detects a change in github. So, in theory, the hudson build should never fail because nobody pushes to github without running the build.

The build takes about three minutes to run, and it executes every unit and acceptance test in the system. Indeed, this same script is the sole criterion we use for deciding whether FitNesse is ready for release. If the build succeeds, we ship!

Anyway, the ant release failed; and it was the new acceptance test that was the cause!

The new test history feature of fitnesse is really useful. I was able to look at the test run that failed and see exactly what the problem was. And the problem was precisely the same problem that Justin and I saw a week ago. It looked like Ordered Query really did have a problem with duplicates. Sometimes.

So I quickly re-validated the lighthouse task and set about to diagnose the problem with a bit more care. I started up FitNesse again and ran the failing test. It passed!

Oh fratz! How was I going to debug this thing if I couldn’t get it to reliably fail? I tried several different configurations of rows and queries, but they all passed just fine. What’s more, a subsequent run of ant release passed too! I only saw it fail twice, and each time a simple recompile made the problem go away.

So I fell back on a very old skill that I haven’t had to use in over ten years. I started to desk check the code! OK, it’s a bit of an exaggeration to say that I haven’t desk checked in over ten years. Actually I desk check all the time. The point is I always have failing tests to support me. This time was different.

Actually the problem (well, I think it was the problem) was not too hard to find. Deep in the algorithm that matched query rows with table rows was a HashSet of the row numbers that had not yet been matched – creatively called unmatchedRows. Every time I matched a row, I removed the number of that row from this set.

This algorithm was written for Query tables and was inherited into Ordered Query. However, in Ordered Query I copied that set into a List for one reason or another.

All this works fine until there is a duplicate. As you might expect, in the Ordered Query the order in which the duplicates are processed is important. Unfortunately, the order of the rows in the list that was copied from the set is indeterminate.

Now, maybe you missed the importance of that last paragraph. It described the bug. So let me re-emphasize. In the case of duplicate rows I was depending on the order of elements in a list; but I built the list from a HashSet. HashSets don’t order their elements. So the order of the list was indeterminate.

The fix was simple. I changed the HashSet into an ArrayList. That fixed it. I think…

The problem is that I had no way to reliably create the symptom. I could not write a test that failed because I was using a HashSet instead of an ArrayList. I could not mock out the HashSet because the bug was in the use of the HashSet. I could not run the application 1000 times to statistically see the fault, because the behavior did not change from run to run. The only thing that seemed able to sometimes expose the fault was a recompile! And I wasn’t about to put my system into a recompile-til-fail loop.

Why a recompile? I suppose it has something to do with the values of internal addresses and their effect upon the hash values within the HashSet.

So, here, finally, after 10 years of being a die-hard TDDer, I have encountered a bug that I could not TDD my way out of. To solve it, I had to guess. To solve it, I had to go open-loop. To solve it, I had to assume that I had found the cause, without being sure. As a result, I am not certain that I have actually solved the problem. It may be that a day or a year from now that acceptance test will fail again; and I’ll be back scratching my head and debugging.

(sigh).

Meta 45 comments, permalink, rss, atom

Comments

Leave a response

Aaron C. Meadows 22 minutes later:

Could you have written your own memory allocator to cause collisions based on the addresses of the elements being added to the HashSet? Possibly basing the allocated locations on the memory locations of the objects put into the HashSet in the failing compiles?

I don’t think I would even have been able to get as near to tested as you did, so I was just wondering if that could work.
Amber Shah 31 minutes later:

I remember writing my first piece of untested code after using tests exclusively for so long. It was actually not due to it being untestable as it was that it was for an organization that didn’t use tests and … nevermind. It felt so completely risky to accept that piece of code, even though I was able to confirm that it worked manually. Of course, when I first graduated, I used to write ALL my code without tests and never felt weird about it.
Gavin Clarke 43 minutes later:

What I still don’t understand is how a recompile would make two different binaries from the same source. Surely compiling is deterministic.
KevDog about 1 hour later:

You have hit the Gödel incompleteness point of TDD. Sooner or later, you drive down to a point where all you have left is an assumption.
Tyler Jennings about 2 hours later:

I had a similar problem figuring out how to test a genetic algorithm engine I was writing. You see, genetic algorithms exhibit some inherent randomness that make them difficult to test. I decided on a statistical approach. My tests read like this:

“Given some state I expect result N to occur 20% of the time”

or

“I always expect to solve problem X within 1000 permutations”

The latter type might work for you – run as many permutations of random sets as you need to reliably expose the bug.
http://uglylispcode.wordpress.com about 3 hours later:

Depending on third party code can be perilous. ;)
Mark Stijnman about 4 hours later:

So, basically the problem is, how do you test classes or functions with undefined behavior? I’m not sure you can, other than avoiding such classes or functions.
Samuel A. Falvo II about 7 hours later:

And thus we see a clear-cut example of where formal verification of code and design-by-contract would have saved you a lot of effort.

First of all, the fact that a* test passes does not imply that *the bug is fixed. It only confirms that you’ve fixed the bug to the extent necessary to pass that test.

You will need to rely on formal proofs of code to verify your code genuinely has fixed the bug. Formal proofs would have discovered the set/list mismatch immediately. Whether your array fixes the bug will, again, depend on your pre- and post-conditions, and your class (and array) invariants.

TDD is dang useful. Don’t get me wrong. But it’s no substitute for formal proof.
Samuel A. Falvo II about 7 hours later:

Oh, and your blog is utterly retarded when it comes to mark-up. Please fix it?
William Pietri about 8 hours later:

Isn’t the kind of thing where you could have discovered it by running 1000 runs with randomly generated data?

Also, @Samuel, is there any way to automate proof checks these days? Adding a fixed amount of human work to each build is awfully expensive if you build and release several times a day.
Kurt Harriger about 9 hours later:

I think perhaps the reason that the order of the objects in the hashset is indeterminate is that the hashcode is probably based on the memory location of the object which would be indeterminate. In .NET you can/must override GetHashCode whenever you override equals because failure to do may cause issues with hashtables. I assume the same is true in Java, so although you may not be able to reliably control the memory address I assume you can control hashcode and thus the order in the hashset. For testing purposes, you just need to subclass whatever it is that you are putting in the hashset to return a specified hashcode ensuring the order in the hashset is deterministic, making it possible to reliably reproduce the failure.
asterkin@nds.com about 9 hours later:
Dear Uncle!

the statement “if I, Uncle Bob, cannot find a test that proves it, this bug is untestable” does not seem to be philosophically sound. It may be true, but may be not. There is not too much to loose if you let to your faithful pupils to try. 10 years is a too large piece of life to give up. May I ask to disclosure the minimal unit test you manage to nail down to, which fails only occasionally, or any other specific technical information isolated as much as possible from the Fitness context. This is a very naive unit test I tried, which does fail as expected: import java.util.ArrayList; import java.util.HashSet; import java.util.List; import java.util.Set; import junit.framework.TestCase;

public class TestDuplicates extends TestCase { private Set unmatchedRows = null; final private Integer rowNumbers[] = new Integer[] { 0,2,3,1,0,4,1 }; final private Integer expected[] = new Integer[] { 0,2,3,1,0,4,1 }; protected void setUp() throws Exception { super.setUp(); this.unmatchedRows = new HashSet(); for(Integer n : rowNumbers) this.unmatchedRows.add(n); } }
```
public void testEqual()
{
    List lst = new  
                    ArrayList(unmatchedRows);
    assertEquals(expected, lst);
}
```
It’s obviously wrong somewhere, but it’s too much guessing involved to proceed further in any direction. At the moment I’m more inclined to think that if facts do not fit the theory, too bad for the facts.

Sincerely and with full respect, Asher

P.S. Sorry for the code formatting, do not know how to make it better.
Cees de Groot about 15 hours later:

Hmm… either I misunderstand something or you could have at least come quite a bit closer by making the particular Set implementation more pluggable and replacing it by a mock set that gives out duplicates in controlled order at testing time, not?
tieTYT 1 day later:

Cees de Groot: That was my first thought. Well actually my first thought was to use a sortedset but a mock set works better.
Chris Salter 3 days later:

I’m with Cees. You could create a mock HashSet that deliberately gave you back items in the wrong order. If that generates a failed test, then your confidence that you’ve found the problem will increase. Of course, there may be other reasons why it would fail that you aren’t testing for, but at least you’ll know that you’ve eliminated one plausible failure scenario.
Edward 17 days later:

I am surprised that it has taken this long to find untestable code. How do you test for a bug caused by a race condition? We have a function that adds random noise. The only way we have found to test it is to plot the noise distribution and examine the shape of the distribution curve. Sometimes you just have to put a human in the loop.
Alex 24 days later:

Whoa, testing the untestable. Crazy. lol.
Ola Bini 24 days later:

Several people have already noted that you could switch out the implementation for HashSet by using an interface instead. Of course, if you really want to check this behavior without doing that (which I assume), the way I would do it is to make sure the objects I send in to the HashSet have a hashcode that I control, which will enable me to change the ordering deterministically.

As far as I can tell, a recompile would not have changed any of the properties that give different hash codes – it’s far more likely that the problem was caused by a statistically low chance of getting the right memory addresses, that the objects in the hashset didn’t have custom hashCode implementations, and that the recompile rearranged memory to a larger degree than the rerunning of tests. All in all an insidious heisenbug, but in no way untestable by standard unit testing approaches.
fabio boldrini 25 days later:

but you have not guaranties on that. The HashSet implementation can change in the future, or some one else can change the HashSet implementation and the test start to fail random again.
Luca Minudel about 1 month later:

I like to learn from experience and I like theory too.

An example of bugs that cannot be unit tested comes from the Halting problem : - detecting the possibility of a deadlock (search Deadlock in the wikipedia and read 2.4 Detection) - detecting the possibility of a race condition

I think to remember that from the Halting problem is possible to demonstrate that also algorithms that use random numbers (like the order of rows in the HashSet you mentioned) potentially could be non testable too.
FLV to ipad converter 8 months later:

you can have a try
supplynflshop about 1 year later:

Originally, the word emblem meant “inlaid ornamental work,” that is Wholesale NFL Jerseys, a

symbol of something else – in this case, that of wealth, because of the time and skill required to do it . Badge probably became

synonymous with emblem in the 15th century Cheap NFL Jerseys. Decal, an abbreviated version of

“decalcomania,”a French word that came into use in the early 20th century referred to the English practice of “transfer printing”

invented in the 18th century. This process transferred the ink from a design or drawing to the glass or porcelain when it was fired in the

kiln NFL Jerseys.
Designer Bags about 1 year later:

Awesome topic.Thanks for sharing!
Dog bark collar about 1 year later:

The HashSet implementation can change in the future, or some one else can change the HashSet implementation and the test start to fail random again.
moncler about 1 year later:

Very few has the knack of writing such beautiful post like you do.. !!moncler mens I have been following this blog and i feel different energy while reading your post. Keep it dear buddy.. !!moncler women
Pandora about 1 year later:

with their own execution of SuiteSetUp and SuiteTearDown.
???? ?????? about 1 year later:

Appreciate your unusual and at the same time creative way of writing, this means much!
Silicone Molding about 1 year later:

Mold making is the core business of Intertech (Taiwan). With world level technology, Intertech enjoys a very good reputation for making Injection Mold and Plastic Moldsfor their worldwide customers.
mechanical Seal about 1 year later:

Thanks for writing this blog post, it was informative, enjoyable, and most importantly – a good length!
massager about 1 year later:

Your site is amazing.I am very impressed to see this,i want to come back for visiting your site.Keep doing Good as well as you can.
mobile screening equipment about 1 year later:

KEESTRACK?manufacturing expert of mobile crusher and mobile screening equipment. Company engaged in the research & development, manufacture, sales of track screening machine,mobile screening,cone crusher,jaw crusher,impact crusher,mobile screening equipment,mobile crushing equipment, till now we are proud to say we are at front of the industry.
dswehfhh about 1 year later:

We are the professional shirts manufacturer, shirts supplier, shirts factory, custom shirts.
okey oyunu oyna about 1 year later:

informative article. Thanks

internette görüntülü olarak okey oyunu oyna, gerçek kisilerle tanis, turnuva heyecanini yasa.
beats by dr dre headphones about 1 year later:

I attempted these Beats by dr dre studio out in several genres thinking about which i listen to an eclectic mix Beats by dr dre solo. A all natural suit With Solos, you really feel the music, not the Monster Beats By Dr. Dre Pro Headphones Black.
Christian louboutin shoes on sale about 1 year later:

What a wonderful article! And are doing well on the whole, but their work also has shortcomings.
adres zoeken op naam about 1 year later:

Thanks for this one!
Bowtrol about 1 year later:

it needs a bokmark so i can come back to it later ,nice stuff
DR OZ african Mango over 2 years later:

Corporations are challenging existing business models as they seek ways to speed innovation, focus on their core competencies, and scale to capitalize on opportunities and outpace competitors.
best sleep aid over 2 years later:

When I at first left a comment I clicked the “Notify me when new comments are added” checkbox and now each time a comment is added I get three notification emails with the same comment. Is there any way you can take away people from that service? Thanks a lot!
shanewatson404@gmail.com over 2 years later:

Wow, this is really superb info is visible in this blog Director of Operations Job Description that to amazing Accounting Job Description technology and using the wonderful messages for this website. I was searching the great info and providing the nice info in this blog and read Graphic Designer Job Description this info you Legal Assistant Job Description can never forget this info
Tips For Bowling over 2 years later:

No, there’s not much competition between puppeteers in general because everybody’s working their own style.
Dr.Dre Casque over 2 years later:

They actual beats through Beats de Dr.Dre Solo HD Graphiteprovide a classy and cozy pattern also as extraordinarily crispy sound reactionbeats by dre Solo Noir. beats by dre Solo pourpretop of the range related to audio is balanced, in conjunction with snug mids and conjointly walloping striped bassbeats by dre Studio Pourpre. Beats By Dre Studio PinkProvided completely are a nice travel case also as a brand new music-telephone-compatible cableBeats By Dre Studio Michael Jackson. the actual Beast Conquer headphones couldn’t be taken devoid of battery powerBeats By Dre Studio Yellow. Beats By Dre Studio All RedPolished African yank style and elegance is quite smudge-prone,beats by dre Pro Blanco and a few songs sound unpleasant. Right dr dre headphones rattles whereas you go. beats by dre Studio MoradosYou’re initial and trendy Huge Surpasses by means of Medical skilled. Dre earphones turn out stable sound, helpful add-ons, and conjointly a look that is not at all imitator.beats by dre LeBron James pro mate relating to fashion-frontward folks that have make the most acquire in order that you’ll spare,beats by dre Studio Blanco they’re fantastic alternative.
christian louboutin over 2 years later:
The professional design make you foot more comfortable. Even more tantalizing,this pattern make your legs look as long as you can,it will make you looked more attractive.Moveover,it has reasonable price.If you are a popular woman,do not miss it.

Technical details of Christian Louboutin Velours Scrunch Suede Boots Coffee:
```
Color: Coffee
Material: Suede
4(100mm) heel
Signature red sole x
```
Fashion, delicate, luxurious Christian louboutins shoes on sale, one of its series is Christian Louboutin Tall Boots, is urbanism collocation. This Christian louboutins shoes design makes people new and refreshing. Red soles shoes is personality, your charm will be wonderful performance.
cheap gucci over 2 years later:

, often large drops of perspiration extract jordan release dates 2011 . Arms, neck pain is jordan 6 often air jordan 5 so tired already, but jordan 3 air jordan 2011 a fe…
Polo Ralph Lauren Pas Cher over 2 years later:

Thanks for this beautiful website. I have enjoyed reading through a few of the articles.