Python Subgroup Detection and Optimization

Posted by tottinger Thu, 28 Jun 2007 04:23:00 GMT

I had a moderately interesting customer problem to work on. I got acquainted with a bit of legacy code that is seriously in need of some interface segregation. It’s an entirely concrete class and used from all over the code base. The question is how to segregate, and that depends on what methods are called from which programs. We ran ‘nm’ to extract the link table from our object files, saving me the trouble of parsing C++ (a scary thought) All that remained was for me to compare the method prototypes used by the object files and find the common sets.

Lacking better ideas, I decided to do this exhaustively in a brute-force kind of way. It is only a few hundred files, so it shouldn’t take too long. I was very wrong. It took a long time and eventually failed.

I had TDD-ed the code, so I had tests of correctness, and I relied on these as I added optimizations, but of course the performance problem occurred only under a real load.

I could have run a profiler on it (and probably should have) but instead I simply monitored my computer. I quickly saw that my time was going into memory allocation, which is also the reason it died after many minutes When I have python performance problems, this is usually the reason, and almost never “interpreter drag”.

My friend Norbert (a gentleman of many programming languages, including ruby) suggested that I wasn’t interning my strings, and of course I was not. I switched to interning strings, and noticed a little improvement which meant the program ran longer before failing from memory problems. Well, the tests still passed, so I knew at least that the logic was still good even if the algorithm was primitive and Occam was spinning in his grave.

Next I realized that I am dealing with a lot of small groups of strings, and in a leap of optimism/faith/stupidity I decided to intern groups of strings. That helped very little, but it did help since my program was now running for the better part of an hour (!!!) and then failing. “More of same” without measurement is hardly a good recipe for optimization.

This is when I realized that I shouldn’t be storing sets or frozensets, which are pretty heavyweight data structures. I had chosen them because I was really working with set intersections, but hadn’t counted the in-memory storage cost. I converted the data structure used by the algorithm and added a local function to make tuples out of the sorted sets.

I was very glad to have my tests to catch me when I had some typing mistake or sloppy conversion. My tests had to be edited, but the changes there were very lightweight, and caused me to abstract out some data comparisons that were (admittedly) repeated. It was all good.

When I ran the full program it completed so quickly that I was sure I’d broken it. It was running with sub-second time, including gathering data from various text files (nm output files). I did a few spot-checks, and determined that it was indeed doing the right thing (as far as I know).

The data is moderately interesting, and I will be able to pick out some useful interfaces. Better yet, I have a program that can pick out all the uses of my big, fat class and recommend interfaces to me. This is all good.

Lessons learned:
  • The tests gave me peace of mind as I worked. I would so hate to have done this “naked”.
  • Python’s speed is fine (even startling) if you aren’t doing something wasteful and heavy-handed like storing hundreds or thousands of non-interned strings and heavyweight data structures
  • I could have been done sooner if I’d measured with hotshot instead of guessing. This is very clear to me, and I won’t think I’m too clever or my problem is too simple to do this ever again.
  • Keep some ruby friends on hand. They come in handy.
  • I didn’t really need a cooler algorithm. Brute force is sometimes enough.

I want to build a new wrapper for this code to compare parameter lists, to help find unrecognized classes in these same programs. It shouldn’t be too hard, but it will be a larger data set so I will probably need some more optimization or a cooler algorithm later. I think I see some waste in it now, but I guess I’ll have to fix that after blogging.

Of course, I’m not done learning. I am sure there are a lot of ways to improve the core code. I also believe in other people, that I should make it available to criticism and suggestions.

Here it is:

Tests

(Which, embarassingly, could use more refactoring)
import unittest
import clumps

def keySet(someMap):
    return set(someMap.keys())

class ClumpFinding(unittest.TestCase):
    def testNoGroupsForSingleItem(self):
        input = { "OneGroup": [1,2] }
        actual = clumps.find_groups(input)
        self.assertEquals({}, actual)

    def testNoOverlapMeansNoGroups(self):
        input = {
            "first": [1,3],
            "second": [2,4]
        }
        actual = clumps.find_groups(input)
        self.assertEquals({}, actual)

    def testIgnoresSingleMatches(self):
        input = {
            "first": [1,3],
            "second": [1,4]
        }
        actual = clumps.find_groups(input)
        self.assertEquals({}, actual)

    def testTwoInterfaceMatches(self):because of
        group = (1,2)
        names = ("first","second")
        input = dict( [(name,group) for name in names])

        actual = clumps.find_groups(input)

        self.assertEquals(1, len(actual))
        [key] = actual.keys()
        self.contentMatch(group, key)
        self.contentMatch(input.keys(), actual[key])

    def contentMatch(self, left,right):
        left,right = map( frozenset, [left,right])
        self.assertEquals(left,right)

    def testFindsThreeGroupsMatchingExactly(self):
        group = [1,3,8]
        names = "one","two","three" 
        input = dict( [(name,group) for name in names ] )

        actual = clumps.find_groups(input)

        self.assertEquals(1, len(actual))
        [clump_found] = actual.keys()
        self.contentMatch(group, clump_found)
        self.contentMatch(names, actual[clump_found])

    def testFindsPartialMatchInThreeGroups(self):
        input = {
            "a":[1,2,3,4,5],
            "b":[1,4,5,6,8],
            "c":[0,1,4,5]
        }
        target_group = frozenset([1,4,5])
        names = input.keys()

        actual = clumps.find_groups(input)

        [key] = actual.keys()
        self.contentMatch(target_group, key)
        self.contentMatch(names, actual[key])

    def testFindsMultipleMatches(self):
        input = {
            "a":[1,2,3,4,5],
            "b":[1,4,5,6,8],
            "c":[0,1,4,5],
            "d":[1,2,3],
            "e":[1,3]
        }

        actual = clumps.find_groups(input)
        keys = actual.keys()

        self.assertEqual(3, len(actual))

        grouping = (1,2,3)
        referents = set(["a","d"])
        self.assert_(grouping in keys, "expect %s in %s" % (grouping,keys) )
        self.assertEqual(referents, actual[grouping])

        grouping = (1,4,5)
        referents = set(["a","b","c"])
        self.assert_(grouping in keys)
        self.assertEqual(referents, actual[grouping])

        grouping = (1,3)
        referents = set(["a","d","e"])
        self.assert_(grouping in keys)
        self.assertEqual(referents, actual[grouping])

if __name__ == "__main__":
    unittest.main()

The Code

import sys

def find_groups(named_groups):
    """ 
    Exhaustively searches for grouping of items in a map, such 
    that an input map like this:
          "first":[1, 2, 3, 4],
          "second":[1,2,3,5,6],
          "third":[1,2,5,6]
    will result in:
        [1,2,3]: ["first","second"]
        [1,2]: ["first","second","third"]
        [5,6]: ["second","third"]

    Note that the return value dict is a mapping of frozensets to sets,
    not lists to lists as given above. Also, being a dict, the results
    are effectively unordered.
    """ 
    def tupleize(data):
        "Convert a set or frozenset or list to a tuple with predictable order" 
        return tuple(sorted(list(data)))

    result = {}
    for name, methods_called in named_groups.iteritems():
        methods_group = frozenset(methods_called)
        methods_tuple = tupleize(methods_group)
        for stored_interface in result.keys():
            key_set = frozenset(stored_interface)
            common_methods = tupleize(key_set.intersection(methods_group))
            if common_methods:
                entry_as_list = list(result.get(common_methods,[]))
                entry_as_list.append(name)
                entry_as_list.extend( result[stored_interface] )
                result[common_methods] = tupleize(entry_as_list)

        full_interface_entry = result.setdefault(methods_tuple, [])
        if name not in full_interface_entry:
            full_interface_entry.append(name)
    return filter(result)

def filter(subsets):
    # Apology: I'm betting I can do this in a functional way.
    filtered = {}
    for key,value in subsets.iteritems():
        if (len(key) > 1) and (len(value) > 1):
            filtered[key] = set(value)
    return filtered

def display_groupings(groupings):
    "Silly helper function to print groupings" 
    keys = sorted(groupings.keys(), cmp=lambda x,y: cmp(len(x),len(y)))
    for key in keys:
        print "\n","-"*40
        for item in key:
            print item
        for item in sorted(groupings[key]):
            print "     ",item
        print

Comments

Leave a response

Comments