Generated Tests and TDD 64

Posted by Uncle Bob Thu, 10 Jan 2008 19:59:30 GMT

TDD has become quite popular, and many companies are attempting to adopt it. However, some folks worry that it takes a long time to write all those unit tests and are looking to test-generation tools as a way to decrease that burden.

The burden is not insignificant. FitNesse, an application created using TDD, is comprised of 45,000 lines of Java code, 15,000 of which are unit tests. Simple math suggests that TDD increases the coding burden by a full third!

Of course this is a naive analysis. The benefits of using TDD are significant, and far outweigh the burden of writing the extra code. But that 33% still feels “extra” and tempts people to find ways to shrink it without losing any of the benefits.

Test Generators.

Some folks have put their hope in tools that automatically generate tests by inspecting code. These tools are very clever. They generate random calls to methods and remember the results. They can automatically build mocks and stubs to break the dependencies between modules. They use remarkably clever algorithms to choose their random test data. They even provide ways for programmers to write plugins that adjust those algorithms to be a better fit for their applications.

The end result of running such a tool is a set of observations. The tool observes how the instance variables of a class change when calls are made to its methods with certain arguments. It notes the return values, changes to instance variables, and outgoing calls to stubs and mocks. And it presents these observations to the user.

The user must look through these observations and determine which are correct, which are irrelevant, and which are bugs. Once the bugs are fixed, these observations can be checked over and over again by re-running the tests. This is very similar to the record-playback model used by GUI testers. Once you have registered all the correct observations, you can play the tests back and make sure those observations are still being observed.

Some of the tools will even write the observations as JUnit tests, so that you can run them as a standard test suite. Just like TDD, right? Well, not so fast…

Make no mistake, tools like this can be very useful. If you have a wad of untested legacy code, then generating a suite of JUnit tests that verifies some portion of the behavior of that code can be a great boon!

The Periphery Problem

On the other hand, no matter how clever the test generator is, the tests it generates will always be more naive than the tests that a human can write. As a simple example of this, I have tried to generate tests for the bowling game program using two of the better known test generation tools. The interface to the Bowling Game looks like this:

  public class BowlingGame {
    public void roll(int pins) {...}
    public int score() {...}
  }
The idea is that you call roll each time the balls gets rolled, and you call score at the end of the game to get the score for that game.

The test generators could not randomly generate valid games. It’s not hard to see why. A valid game is a sequence of between 12 and 21 rolls, all of which must be integers between 0 and 10. What’s more, within a given frame, the sum of rolls cannot exceed 10. These constraints are just too tight for a random generator to achieve within the current age of the universe.

I could have written a plugin that guided the generator to create valid games; but such an algorithm would embody much of the logic of the BowlingGame itself, so it’s not clear that the economics are advantageous.

To generalize this, the test generators have trouble getting inside algorithms that have any kind of protocol, calling sequence, or state semantics. They can generate tests around the periphery of the classes; but can’t get into the guts without help.

TDD?

The real question is whether or not such generated tests help you with Test Driven Development. TDD is the act of using tests as a way to drive the development of the system. You write unit test code first, and then you write the application code that makes that code pass. Clearly generating tests from existing code violates that simple rule. So in some philosophical sense, using test generators is not TDD. But who cares so long as the tests get written, right? Well, hang on…

One of the reasons that TDD works so well is that it is similar to the accounting practice of dual entry bookkeeping. Accountants make every entry twice; once on the credit side, and once on the debit side. These two entries follow separate mathematical pathways. In the end a magical subtraction yields a zero if all the entries were made correctly.

In TDD, programmers state their intent twice; once in the test code, and again in the production code. These two statements of intent verify each other. The tests, test the intent of the code, and the code tests the intent of the tests. This works because it is a human that makes both entries! The human must state the intent twice, but in two complementary forms. This vastly reduces many kinds of errors; as well as providing significant insight into improved design.

Using a test generator breaks this concept because the generator writes the test using the production code as input. The generated test is not a human restatement, it is an automatic translation. The human states intent only once, and therefore does not gain insights from restatement, nor does the generated test check that the intent of the code was achieved. It is true that the human must verify the observations, but compared to TDD that is a far more passive action, providing far less insight into defects, design and intent.

I conclude from this that automated test generation is neither equivalent to TDD, nor is it a way to make TDD more efficient. What you gain by trying to generate the 33% test code, you lose in defect elimination, restatement of intent, and design insight. You also sacrifice depth of test coverage, because of the periphery problem.

This does not mean that test generators aren’t useful. As I said earlier, I think they can help to partially characterize a large base of legacy code. But these tools are not TDD tools. The tests they generate are not equivalent to tests written using TDD. And many of the benefits of TDD are not achieved through test generation.

So just what does synchronized do? 56

Posted by Brett Schuchert Fri, 04 Jan 2008 03:27:00 GMT

Synopsis

Using synchronized turns a huge state space into a comparatively small one.

Normally, the light from a star radiates out in all directions. But what happens when a star collapses? There are several possibilities depending on the mass of the star. Our sun will turn into a red giant and then later turn into a white dwarf, giving out light from its accumulated heat for many years after living on Earth has become unbearable; mostly because of all the traffic.

If the mass of the star is around 10x our star, its destiny is just a bit different. It will begin to collapse. Along the way, it will probably have some temporary reprieves from its ultimate fate, to become a Neutron star, by consuming other heavier elements such as carbon and helium . However, the writing is on the wall. The center of star, already an extreme environment, becomes even more so; eventually, the pressure from the collapsing of material into the center of the star results in a massive explosion known as a nova or super nova – depending on the mass of the star. What is left is a neutron star. A neutron star is the heart of a pulsar, able to spin at amazing rates without pulling itself apart.

Another option is a black hole. A black hole is an often used metaphor for either the effort expended on support/fixing/updating a big ball of mud or what the development effort appears to be in most of the time. Either way, it is not considered a good thing if you’re anywhere in the vicinity.

I’m a pessimist by nature. I see a project in decline, sucking in other resources, like a black holes eating neighboring galaxies, snuffing the light out of stars, just like projects can suck the life force out of individuals.

However, energy can escape from a black hole, eventually leading to the demise of the black hole. Energy can escape along the axis of rotation in either direction. In a sense, if we see the expansion of the black hole as a bad thing, anything that diminishes a black hole is a good thing. Stephen Hawking theorized that at the event horizon of black holes, fluctuations in the space can cause the spontaneous creations of pairs of particles and their anti particles. This is another way it can appear that a black hole is losing energy since it gives off particles while the opposite particle goes into the black hole.

Now I’m off track. I wanted to talk about how synchronized give a similar effect as the massive gravity of black hole. But instead of curving space-time, they invert the possible state space of code in the presence of threads. Let’s take a look.

Here’s a simple class:

public class X {
   int value;

   public synchronized void incrementValue() {
        ++value;
    }
}
Let’s say, for the sake of argument, two threads are trying to execute the single method in this class. If the method is written like this:

   public synchronized void incrementValue() {
        ++value;
    }

Then there are only two orderings, Thread 1/Thread 2, or Thread 2/Thread 1. All of those paragraphs at the top were to make this analogy.

OK, so if we use synchronized that that’s the black hole. The number of possible paths through the code for N threads is N!. What’s the opposite? What happens if we take off the keyword and then run two threads through it? In how many ways can that single line of code get executed?

Are you ready for it?

3,432 different ways

Really? Well at least conceptually. In practice, the Just In Time compiler will probably turn the JVM instructions into native code, so the actual number of individual steps will probably be less. But there really are that many paths through the system. You might ask how?

That single line of code is represented in byte code as 7 lines. I could show you the disassembly, but really, it’s 7. You can convince yourself by having a look at the byte-code using a nice byte-code viewer plug-in for eclipse.

What happens if we change the int to a long? We still have 7 lines of byte-code, but with different instructions. We have a total of 4 long reads/writes. Each long read/write requires two individual 32-bit operations – or at lest the JVM can implement it that way. The actual number of atomic operations after the Just In Time compiler has had its way with your byte-code will probably be less, but if we stick to byte-codes and the Java memory model, the number 3,432 is now a whopping 705,432!

Let’s extend that just a bit more. What if you have several lines of code? Each line of Java results in many more lines of byte-code. What if we have have 3 lies of Java? We probably have something like 21 lines of byte-code. How many paths of execution would 21 lines of byte-code executed by two threads have? 538,257,874,440!

Where the first program ended up being a white dwarf, the version using longs was a nova. I’m not sure what to call the thee line Java method, maybe a hyper-nova?

In practice, the actual number of paths through a method will end up being smaller. And, most of the paths do not have any negative side-effects. The problem, however, is that there are a small number of hard to find paths of execution that do cause problems and finding them is like looking for a needle in a haystack.

Remember what adding the keyword synchronized did? If we use it on a three line method, it turns 538,257,874,440 into 2. It collapses the number of paths into 2 for two threads. And 2 is less than 538,257,874,440, for even vary large values of 2.

How did I derive at these magic numbers? Two ways.

I wanted to know how many possible orderings there were for 7 instructions and two threads. I knew it had to do with combinations and permutations but I just wasn’t smart enough to figure it out. Searching on Google didn’t do much for me either. So I decided to write a program to numerically derive the result. I tested my way into a program that would calculate the result.

The basic algorithm is:
  1. Determine all possible orderings of each of the unique “nodes” in a system. For example, if I have 2 threads and 2 instructions, I can generate a natural key (ugh!) to represent each of the different combinations of threads and instructions: T1S1 T1S2 T2S1 2S2.
  2. Produce all valid permutations of these four “nodes”.
  3. Remove any that have invalid orderings.
  4. Count the results.

Many of my first implementations for some of these steps were exercises in writing really inefficient code! But having those tests made it really easy to swap out better algorithms when my brain caught up with what I was doing.

What is an invalid ordering? The code generated by the single line using the pre increment operator has 7 steps. Those 7 steps are in a single sequence with no looping and no conditionals. That’s like saying A is always before B. So if we generate a list of orderings that end up including that combination, we’ve generated an ordering that cannot happen.

So you can describe your possible state space in a few steps:
  1. Register a node for each thread, step combination. 7 threads, 4 steps = 28 nodes.
  2. Register dependences between steps. For example, if I have seven steps and two threads executing the same sequence, I could describe this as: T1S1 -> T1S2 -> T1S3, etc. And for the second thread: T2S1 -> T2S2 -> T2>S3. Since the name is arbitrary, I just used letters, so in fact I had: a -> b -> c -> d -> e -> f -> g for one set of dependencies and h -> i -> j -> k -> l -> m -> o

In fact, here’s the test I used to describe that exact space:


    @Test
    public void twoThreadsAndSevenSteps() {
        calculator = new PathCalculator(14);
        createLine('a', 'b', 'c', 'd', 'e', 'f', 'g');
        createLine('h', 'i', 'j', 'k', 'l', 'm', 'n');
        start();
        int totalPaths = calculator.totalPaths();
        stop(2, 4, totalPaths);
    }

By the time I got to this point, I wrote some “tests” using JUnit as a harness to run my code and display timing information and a summary of the


Time:        4ms -- Threads:  3, Opcodes:  2, Total paths:      90
Time:        1ms -- Threads:  2, Opcodes:  3, Total paths:      20
Time:       54ms -- Threads:  4, Opcodes:  2, Total paths:    2520
Time:        3ms -- Threads:  2, Opcodes:  4, Total paths:      70
Time:       61ms -- Threads:  2, Opcodes:  5, Total paths:     252
Time:     1248ms -- Threads:  2, Opcodes:  6, Total paths:     924
Time:    12819ms -- Threads: 10, Opcodes:  1, Total paths: 3628800
I then used this information to fit a cure to my data and derive a formula for various combinations. For 2 threads, I came up with the following formula. If you let N be the number of steps:
Total Paths = 252 * 3.6667 N-5

The morning after, Uncle Bob sent an email, showing me his work on how he calculated.

If you let:
  1. T be the number of threads
  2. N be the number of steps
Total = (N*T)!   /   N! T

When I plugged numbers into his formula, they fit my formula. His formula generated an exact result as opposed to my estimate based on fitting a curve. Another thing his solution has going for it is the algorithm I came up with, I believe, is NP-complete. The net effect was I that while I was able to calculate result for 2 threads and 6 steps, 2 threads and 7 steps took a long time before running out of memory with a 2GB heap space. Bob’s formula gave me a so-called verifier (as did my formula – only Bob’s was better) and the algorithm seems to grow based on factorials and not polynomials.

What’s the point of all of this? The single, one-line program, is a demonstration of why going from one thread to two threads makes writing/debugging/testing programs so much harder. It also numerically demonstrates why the single responsibility principle is even more important when we start writing concurrent programs. The more you put in one place, the large the state space. Larger state spaces:

  1. Make it harder to find that one or few number of paths that can cause your program to die, under load, in production, while you are on vacation
  2. Make your efforts to test (which is a sampling technique anyway) for concurrency issues all the more difficult
  3. Make it harder to find what might actually have caused a failure once you’ve notice it.

In a later post, I’ll describe how we can improve our chances of covering this huge space of potential orderings.

If you’ve made it this far, wow! Congratulations!

Older posts: 1 ... 7 8 9