Python Subgroup Detection and Optimization

Python Subgroup Detection and Optimization 13

Posted by tottinger Thu, 28 Jun 2007 04:23:00 GMT

I had a moderately interesting customer problem to work on. I got acquainted with a bit of legacy code that is seriously in need of some interface segregation. It’s an entirely concrete class and used from all over the code base. The question is how to segregate, and that depends on what methods are called from which programs. We ran ‘nm’ to extract the link table from our object files, saving me the trouble of parsing C++ (a scary thought) All that remained was for me to compare the method prototypes used by the object files and find the common sets.

Lacking better ideas, I decided to do this exhaustively in a brute-force kind of way. It is only a few hundred files, so it shouldn’t take too long. I was very wrong. It took a long time and eventually failed.

I had TDD-ed the code, so I had tests of correctness, and I relied on these as I added optimizations, but of course the performance problem occurred only under a real load.

I could have run a profiler on it (and probably should have) but instead I simply monitored my computer. I quickly saw that my time was going into memory allocation, which is also the reason it died after many minutes When I have python performance problems, this is usually the reason, and almost never “interpreter drag”.

My friend Norbert (a gentleman of many programming languages, including ruby) suggested that I wasn’t interning my strings, and of course I was not. I switched to interning strings, and noticed a little improvement which meant the program ran longer before failing from memory problems. Well, the tests still passed, so I knew at least that the logic was still good even if the algorithm was primitive and Occam was spinning in his grave.

Next I realized that I am dealing with a lot of small groups of strings, and in a leap of optimism/faith/stupidity I decided to intern groups of strings. That helped very little, but it did help since my program was now running for the better part of an hour (!!!) and then failing. “More of same” without measurement is hardly a good recipe for optimization.

This is when I realized that I shouldn’t be storing sets or frozensets, which are pretty heavyweight data structures. I had chosen them because I was really working with set intersections, but hadn’t counted the in-memory storage cost. I converted the data structure used by the algorithm and added a local function to make tuples out of the sorted sets.

I was very glad to have my tests to catch me when I had some typing mistake or sloppy conversion. My tests had to be edited, but the changes there were very lightweight, and caused me to abstract out some data comparisons that were (admittedly) repeated. It was all good.

When I ran the full program it completed so quickly that I was sure I’d broken it. It was running with sub-second time, including gathering data from various text files (nm output files). I did a few spot-checks, and determined that it was indeed doing the right thing (as far as I know).

The data is moderately interesting, and I will be able to pick out some useful interfaces. Better yet, I have a program that can pick out all the uses of my big, fat class and recommend interfaces to me. This is all good.

Lessons learned:

The tests gave me peace of mind as I worked. I would so hate to have done this “naked”.
Python’s speed is fine (even startling) if you aren’t doing something wasteful and heavy-handed like storing hundreds or thousands of non-interned strings and heavyweight data structures
I could have been done sooner if I’d measured with hotshot instead of guessing. This is very clear to me, and I won’t think I’m too clever or my problem is too simple to do this ever again.
Keep some ruby friends on hand. They come in handy.
I didn’t really need a cooler algorithm. Brute force is sometimes enough.

I want to build a new wrapper for this code to compare parameter lists, to help find unrecognized classes in these same programs. It shouldn’t be too hard, but it will be a larger data set so I will probably need some more optimization or a cooler algorithm later. I think I see some waste in it now, but I guess I’ll have to fix that after blogging.

Of course, I’m not done learning. I am sure there are a lot of ways to improve the core code. I also believe in other people, that I should make it available to criticism and suggestions.

Here it is:

Tests

(Which, embarassingly, could use more refactoring)

import unittest
import clumps

def keySet(someMap):
    return set(someMap.keys())

class ClumpFinding(unittest.TestCase):
    def testNoGroupsForSingleItem(self):
        input = { "OneGroup": [1,2] }
        actual = clumps.find_groups(input)
        self.assertEquals({}, actual)

    def testNoOverlapMeansNoGroups(self):
        input = {
            "first": [1,3],
            "second": [2,4]
        }
        actual = clumps.find_groups(input)
        self.assertEquals({}, actual)

    def testIgnoresSingleMatches(self):
        input = {
            "first": [1,3],
            "second": [1,4]
        }
        actual = clumps.find_groups(input)
        self.assertEquals({}, actual)

    def testTwoInterfaceMatches(self):because of
        group = (1,2)
        names = ("first","second")
        input = dict( [(name,group) for name in names])

        actual = clumps.find_groups(input)

        self.assertEquals(1, len(actual))
        [key] = actual.keys()
        self.contentMatch(group, key)
        self.contentMatch(input.keys(), actual[key])

    def contentMatch(self, left,right):
        left,right = map( frozenset, [left,right])
        self.assertEquals(left,right)

    def testFindsThreeGroupsMatchingExactly(self):
        group = [1,3,8]
        names = "one","two","three" 
        input = dict( [(name,group) for name in names ] )

        actual = clumps.find_groups(input)

        self.assertEquals(1, len(actual))
        [clump_found] = actual.keys()
        self.contentMatch(group, clump_found)
        self.contentMatch(names, actual[clump_found])

    def testFindsPartialMatchInThreeGroups(self):
        input = {
            "a":[1,2,3,4,5],
            "b":[1,4,5,6,8],
            "c":[0,1,4,5]
        }
        target_group = frozenset([1,4,5])
        names = input.keys()

        actual = clumps.find_groups(input)

        [key] = actual.keys()
        self.contentMatch(target_group, key)
        self.contentMatch(names, actual[key])

    def testFindsMultipleMatches(self):
        input = {
            "a":[1,2,3,4,5],
            "b":[1,4,5,6,8],
            "c":[0,1,4,5],
            "d":[1,2,3],
            "e":[1,3]
        }

        actual = clumps.find_groups(input)
        keys = actual.keys()

        self.assertEqual(3, len(actual))

        grouping = (1,2,3)
        referents = set(["a","d"])
        self.assert_(grouping in keys, "expect %s in %s" % (grouping,keys) )
        self.assertEqual(referents, actual[grouping])

        grouping = (1,4,5)
        referents = set(["a","b","c"])
        self.assert_(grouping in keys)
        self.assertEqual(referents, actual[grouping])

        grouping = (1,3)
        referents = set(["a","d","e"])
        self.assert_(grouping in keys)
        self.assertEqual(referents, actual[grouping])

if __name__ == "__main__":
    unittest.main()

The Code

import sys

def find_groups(named_groups):
    """ 
    Exhaustively searches for grouping of items in a map, such 
    that an input map like this:
          "first":[1, 2, 3, 4],
          "second":[1,2,3,5,6],
          "third":[1,2,5,6]
    will result in:
        [1,2,3]: ["first","second"]
        [1,2]: ["first","second","third"]
        [5,6]: ["second","third"]

    Note that the return value dict is a mapping of frozensets to sets,
    not lists to lists as given above. Also, being a dict, the results
    are effectively unordered.
    """ 
    def tupleize(data):
        "Convert a set or frozenset or list to a tuple with predictable order" 
        return tuple(sorted(list(data)))

    result = {}
    for name, methods_called in named_groups.iteritems():
        methods_group = frozenset(methods_called)
        methods_tuple = tupleize(methods_group)
        for stored_interface in result.keys():
            key_set = frozenset(stored_interface)
            common_methods = tupleize(key_set.intersection(methods_group))
            if common_methods:
                entry_as_list = list(result.get(common_methods,[]))
                entry_as_list.append(name)
                entry_as_list.extend( result[stored_interface] )
                result[common_methods] = tupleize(entry_as_list)

        full_interface_entry = result.setdefault(methods_tuple, [])
        if name not in full_interface_entry:
            full_interface_entry.append(name)
    return filter(result)

def filter(subsets):
    # Apology: I'm betting I can do this in a functional way.
    filtered = {}
    for key,value in subsets.iteritems():
        if (len(key) > 1) and (len(value) > 1):
            filtered[key] = set(value)
    return filtered

def display_groupings(groupings):
    "Silly helper function to print groupings" 
    keys = sorted(groupings.keys(), cmp=lambda x,y: cmp(len(x),len(y)))
    for key in keys:
        print "\n","-"*40
        for item in key:
            print item
        for item in sorted(groupings[key]):
            print "     ",item
        print

Posted in Tim's Tepid Torrent
Meta 13 comments, permalink, rss, atom

Comments

Leave a response

iPhone to Mac Transfer over 3 years later:

The software you can trust to export iPhone music, video and more to Mac.
Pandora over 3 years later:

In short, the test suite makes it easy to make changes to my code. It makes my code flexible and easy to maintain.
Criminal Records over 4 years later:

The data is moderately interesting, and I will be able to pick out some useful interfaces.
Tenant Screening over 4 years later:

I think I see some waste in it now, but I guess I’ll have to fix that after blogging.
Gaylene Mortell over 4 years later:

Really nice post, I have been looking to read something of this nature and your writing has fit the bill perfectly, thanks very much indeed.
okey oyunu oyna over 4 years later:

great code i will use it. Thanks

Gerçek kisilerle sohbet ederek Okey Oyunu Oyna ve internette online oyun oynamanin zevkini çikar.
chaussures over 4 years later:

It is often detailed that should “ 50 ‘ soon after the” equipped coupled with encouraged together with the Fang-Fang Li, Liu Dong, Huang Ming brand-new legend handsetJimmy Choo Escarpins sandales. Fang-Fang Li were able to move on as a result of Ny University College student Bank concerned with Training video Attempting Business while in the the front without the need of hand back is often Ang Protect; Jimmy Choo Tall Bootsyour ex-girlfriend compact popularity, ‘07 ages youngster exactly who manufacture timeless tomes, “ 17-year-old yowl, ” Jimmy Choo Tall Bootscoupled with made use of within 10 types considering the exact make concerned with its TV cord, prevalent watch, the following turned sailing the initial accolade reward along with the Enthusiastic Head limitation Reward

the north face
.

At this juncture a person’s preparation while in the program “ 50 ‘ after” the next concerned with 2007 have been remaining picked up all over Tokyo, “ Asian Youngster

north face jakke
program. ” “ 50 ‘ soon after the” Liu Dong Chen XingchenDiscount Ugg Boots |Discount Ugg Boots gamed outside together with the business expansion considering 1980s track record while in the big vary arrangement China’ ersus marvelous improvements all over couple of a long time, signifying business expansion concerned with 50 youngster right after track record utilizing their love, conduct coupled with likelihood so you might impressive symptoms.canada goose This amazing news flash press reporter determined this, “ 50 ‘ after” is going to be launched all over Tiongkok all over past due Augustcanada goose jakke.
Air Jordan Max Fusion over 4 years later:

A excellent dude is invariably ready to acquire little.
Buy Nike Air Yeezy over 4 years later:

The worst method to forget some one is for getting sitting centerbesidehim knowing you cant have him.
ysbearing over 4 years later:

Slewing bearing called slewing ring bearings, is a comprehensive load to bear a large bearing, can bear large axial, radial load and overturning moment.
ysbearing over 4 years later:

Slewing bearing called slewing ring bearings, is a comprehensive load to bear a large bearing, can bear large axial, radial load and overturning moment.
backup iPhone sms over 5 years later:

Well. Though I am not a good application developer. And I need do more hard work to improve myself. When I come to here. I know that I have come to the right place to learn something I need. Thanks for your good advice. And I will do the practice as possible as I can. Thanks.
hermes berkin over 5 years later:

What is Diaper Rash or Nappy Rash?Diaper rash is a generic term applied to skin irritation that develops in the diaper-covered area that is caused by various disorders and irritants?

Mentor	twitter id
Uncle Bob	unclebobmartin
Brett Schuchert	schuchert
Michael Feathers	mfeathers
Bob Koss	bob_koss