Revisit: The common subgroups

Revisit: The common subgroups 22

Posted by tottinger Tue, 03 Jul 2007 15:43:00 GMT

In cleaning up the code, I simplified the algorithm a very little and improved performance considerably. Amazing how that works, how simpler equals faster for so much code. Adding simple data structures, local explanatory functions, and the like often make code much faster.

What I’m hoping is that I will use this in a few different and useful ways.

The first way is to look for interfaces where concrete classes are being used from many other classes. You need to add an interface, but don’t know who needs which part. The goal is to figure out a relatively small number of interfaces that satisfy a number of clients in a module.

The second use would be to look for common clumps of parameters when I’m working in a large code base where the average number of arguments per function call does not remotely approach one. I suspect that there are clumps of similarly-named variables being passed around, and that these are likely “missed classes”. Sometimes these are obvious, but it would be good to see them in a nice list spit out from a nice tool.

So this is a hopeful start on a series of useful tools.

Code follows.

import shelve
import sys

def find_groups(input):
    """ 
    Exhaustively searches for grouping of items in a map, such
    that an input map like this:
        "first":[1, 2, 3, 4],
        "second":[1,2,3,5,6],
        "third":[1,2,5,6]
    will result in:
        [1,2,3]: ["first","second"]
        [1,2]: ["first","second","third"]
        [5,6]: ["second","third"]

    Note that the return value dict is a mapping of frozensets to sets,
    not lists to lists as given above. Also, being a dict, the results
    are effectively unordered.
    """ 
    def tupleize(data):
        "Convert a set or frozenset or list to a tuple with predictable order" 
        return tuple(sorted(set(data)))

    def append_values(map, key, *values):
        key=tupleize(key)
        old_value = map.get(key,[])
        new_value = list(old_value) + list(values)
        new_value = tupleize(new_value)
        map[key] = new_value
        return key, new_value

    result = {}
    previously_seen = {}
    for input_identity, signatures in input.iteritems():
        input_signatures = set(signatures)
        for signature_seen, identities_seen in previously_seen.iteritems():
            common_signatures = set(signature_seen).intersection(input_signatures)
            if len(common_signatures) > 1:
                known_users = list(identities_seen) + [input_identity]
                append_values(result, common_signatures, *known_users)
        append_values(previously_seen, signatures, input_identity)
    return filter(result)

def filter(subsets):
    filtered = {}
    for key,value in subsets.iteritems():
        if (len(key) > 1) and (len(value) > 1):
            filtered[key] = set(value)
    return filtered

def display_groupings(groupings):
    "Silly helper function to print groupings" 
    keys = sorted(groupings.keys(), cmp=lambda x,y: cmp(len(x),len(y)))
    for key in keys:
        print "\n","-"*40
        for item in key:
            print item
        for item in sorted(groupings[key]):
            print "     ",item
        print

Posted in Tim's Tepid Torrent
Tags algorithm, clean, optimized, python
Meta 22 comments, permalink, rss, atom

Comments

Leave a response

Erik about 4 hours later:

Hi Tim,

Interesting little algorithm indeed.

I wrote a version of it in Java (but I can hardly imagine you’d be interested to see it) which is something “in between” the two versions you published. It took me some time and analysis to understand your first version of the algorithm and see that it was only a slight variation of the one I had written. Maybe I’ll have a look at “good old” Knuth to refresh my understanding of algorithms…

BTW: I think you no longer need the filter function in the version above as you only put entries with at least two keys and two values in the result dictionary. One more improvement, if I’m right :-).
Tim about 6 hours later:

Well, it’s not much of an algorithm. I really thought it was pretty much a brute-force kind of n^2 kind of thing. I’m not bragging on it being a smart algorithm, to be sure. Let me know if the good Dr. Knuth has a better one.

I think you’re right about the filter. It runs faster (of course) and does not fail any of the tests without. It also made me aware that my output is not ordered very well, and I can’t really compare the output of one run of the code against the output of another run against the same data. Ah, it’s always something.

It reminded me of the importance of having diffable output. I thought that removing the filter would have no effect, and I even popped in a printf, but I wanted to see that the output of the production run was identical down to the md5sum.

This seems to be the way that things get complicated.

Even with the ordering changes in the display method, we’re down to about 2.5 seconds on my box. Thanks for noticing!

Now I should run this on bigger data. :-)

Erik 1 day later:

Hi Tim,

This is not an improvement of the algorithm (only an optimization) but the version below (containing only the changed parts) is approximately twice as fast as your second version in a simple test on my system. Maybe it is worth it…

def append_values(map, key, values):
    key=tupleize(key)
    old_value = map.get(key,[])
    new_value = list(old_value) + list(values)
    new_value = set(new_value)
    map[key] = new_value
    return key, new_value

result = {}
tuples = tuple(input.items())

for i in range(len(tuples)):
    input_identity, signatures = tuples[i]
    for j in range(i + 1, len(tuples)):
        input_signatures = set(signatures)
        identities_seen, signature_seen = tuples[j]
        common_signatures = set(signature_seen).intersection(input_signatures)
        if len(common_signatures) > 1:
            known_users = [identities_seen, input_identity]
            append_values(result, common_signatures, known_users)
return result

Note: the equivalent Java version is about thrice as fast but the Python version is more fun :-)

Tim 1 day later:

Is the java three times as fast including the JVM startup time? On my machine, I can’t load the jvm in .2 seconds, let alone complete all the work (including data collection) in .2 seconds. But my real reason for python is certainly the fun, and to a lesser degree the portability compared to java.

I will dig through this and make the same change on my version. Thanks again for playing. It is a fun game. I should put the source on an Mercurial server and make it available. We could maybe come up with some fun uses.
Erik 1 day later:

Hi Tim,

No, the JVM startup time was not included in my previous speed comparison.

On my machine (T5500 1,66GHz Dual Core CPU) starting the JVM takes about 0.2 seconds (with Java 1.6.0) – I must admit that this is much faster than I thought it would be. It therefore doesn’t change the speed ratio as soon as the actual processing requires more than two seconds or so.
ipad 3g converter over 3 years later:

If you are a iPad 3G user, this iPad 3G Converter can help you to convert videos with various video formats to iPad 3G and share it everywhere as you like. It can not only convert videos to the newly developed iPad 3G, but can also convert videos to other mainstream portable devices, including iPad, iPod, iPod Nano, iPod Classic, psp, ps3, etc.If you are a mac user, here we also recommend this iPad 3G Converter for Mac to you which is specially designed for mac users. Here we also suggest you a worthy software pack which called iPad Converter Suite which consists of three software and just cost you 35 dallors, hurry up!
AVI to ipad over 3 years later:

this is th e best
juicy couture tracksuit over 3 years later:

Like your weblog I will subscribe
rosetta stone over 3 years later:

Bid on Power Bracelet now! Find Bracelets. New Colors & Sizes. power balance wholesale Guarantee & Secure Ordering.Power balance USA, Canada, UK Free Shipping. Rosetta Stone allows headstones to provide genealogical and historical information about the deceased directly to cemetery site visitors.
puma over 4 years later:

v
Criminal Records over 4 years later:

It also made me aware that my output is not ordered very well, and I can’t really compare the output of one run of the code against the output of another run against the same data.
Tenant Screening over 4 years later:

It reminded me of the importance of having diffable output. I thought that removing the filter would have no effect, and I even popped in a printf, but I wanted to see that the output of the production run was identical down to the md5sum.
mens polo shirts cheap over 4 years later:

mens polo shirts cheap

polo sweaters
damper over 4 years later:

Our company is engaged in the professional manufacturer of damper, air cylinder, oil cylinder, and hydraulic station. The company has many year’s production experience and strong technical power.
okey oyunu oyna over 4 years later:

thanks so much admin…

Gerçek ki?ilerle sohbet ederek okey oyunu oyna ve internette online oyun oynaman?n zevkini ç?kar.
Monster Beats By Dr. Dre Studio over 4 years later:

The Monster Beats By Dr. Dre Studio Headphones are classic shoes.You will be proud of owning them. Don’t hesitate!Take Monster Beats By Dr. Dre Pro home! Choose our Monster Beats By Dr. Dre Solo Headphones will make you have an amazing discovery.
christian louboutin shoes on sale over 4 years later:

Have the christian louboutin patent leather pumps is a happy thing. Here have the most complete kinds of christian louboutin leather platform pumps.
Hancy over 4 years later:

Hello Friend,Whichever style of Fashion Shoes you’re looking for, classical, fashionable, lovely or the latest design, you can find your favorite designer shoes in www.dunkpage.com ,several days ago I bought one pair of shoes from there,It’s beautiful and very comfortable!
canada goose coat over 4 years later:

Canada Goose Outlet is Marmot 8000M Parka. The Marmot 8000M Parka is really a waterproof, breathable jacket with 800 fill canada goose jacket feathers. It truly is design and light colored shell is produced for trendy, but uncomplicated, protection from cold temperatures. Reinforced shoulders, elbows and adjustable waist and hem make the Marmot a perfect alternate for skiing and other outdoor sports that want fairly a bit of arm motion. The 8000M Parka weighs three lbs., comes in bonfire and black colours and might be stuffed and stored like a sleeping bag to your convenience.This is one of well-know and prime down jacket brands.Hope our friends like its!Like canada goose womens and Canada Goose Expedition Parka.There arewholesale canada goose.
ysbearing over 4 years later:

Slewing bearing called slewing ring bearings, is a comprehensive load to bear a large bearing, can bear large axial, radial load and overturning moment.
chat over 5 years later:

thank you very much executives like you, I yemek tariflerireally like your blog verymetin2 pvp clean and reliable link to a web site trying to get the ban I hope I do not eat many thanks Web sites with the music here is actually fun, chat, friendship, bi kind of people that I’m mt2 pvptrying to create an ideal environment for fusion with each other and chat environments, chatand sexual issues, as well as food, drink, and even my own social networking areas, such as creating bayan giyima single goal is to serve our valuable administrators chathope you understand If you need to make
backup iPhone sms over 5 years later:

Well. Though I am not a good application developer. And I need do more hard work to improve myself. When I come to here. I know that I have come to the right place to learn something I need. Thanks for your good advice. And I will do the practice as possible as I can. Thanks.

Mentor	twitter id
Uncle Bob	unclebobmartin
Brett Schuchert	schuchert
Michael Feathers	mfeathers
Bob Koss	bob_koss

Revisit: The common subgroups 22

Comments

Blog Search

Follow us on twitter

Categories

Blogroll

Syndicate