Revisit: The common subgroups 20

Posted by tottinger Tue, 03 Jul 2007 15:43:00 GMT

In cleaning up the code, I simplified the algorithm a very little and improved performance considerably. Amazing how that works, how simpler equals faster for so much code. Adding simple data structures, local explanatory functions, and the like often make code much faster.

What I’m hoping is that I will use this in a few different and useful ways.

The first way is to look for interfaces where concrete classes are being used from many other classes. You need to add an interface, but don’t know who needs which part. The goal is to figure out a relatively small number of interfaces that satisfy a number of clients in a module.

The second use would be to look for common clumps of parameters when I’m working in a large code base where the average number of arguments per function call does not remotely approach one. I suspect that there are clumps of similarly-named variables being passed around, and that these are likely “missed classes”. Sometimes these are obvious, but it would be good to see them in a nice list spit out from a nice tool.

So this is a hopeful start on a series of useful tools.

Code follows.

import shelve
import sys

def find_groups(input):
    """ 
    Exhaustively searches for grouping of items in a map, such
    that an input map like this:
        "first":[1, 2, 3, 4],
        "second":[1,2,3,5,6],
        "third":[1,2,5,6]
    will result in:
        [1,2,3]: ["first","second"]
        [1,2]: ["first","second","third"]
        [5,6]: ["second","third"]

    Note that the return value dict is a mapping of frozensets to sets,
    not lists to lists as given above. Also, being a dict, the results
    are effectively unordered.
    """ 
    def tupleize(data):
        "Convert a set or frozenset or list to a tuple with predictable order" 
        return tuple(sorted(set(data)))

    def append_values(map, key, *values):
        key=tupleize(key)
        old_value = map.get(key,[])
        new_value = list(old_value) + list(values)
        new_value = tupleize(new_value)
        map[key] = new_value
        return key, new_value

    result = {}
    previously_seen = {}
    for input_identity, signatures in input.iteritems():
        input_signatures = set(signatures)
        for signature_seen, identities_seen in previously_seen.iteritems():
            common_signatures = set(signature_seen).intersection(input_signatures)
            if len(common_signatures) > 1:
                known_users = list(identities_seen) + [input_identity]
                append_values(result, common_signatures, *known_users)
        append_values(previously_seen, signatures, input_identity)
    return filter(result)

def filter(subsets):
    filtered = {}
    for key,value in subsets.iteritems():
        if (len(key) > 1) and (len(value) > 1):
            filtered[key] = set(value)
    return filtered

def display_groupings(groupings):
    "Silly helper function to print groupings" 
    keys = sorted(groupings.keys(), cmp=lambda x,y: cmp(len(x),len(y)))
    for key in keys:
        print "\n","-"*40
        for item in key:
            print item
        for item in sorted(groupings[key]):
            print "     ",item
        print