Revisit: The common subgroups 5
In cleaning up the code, I simplified the algorithm a very little and improved performance considerably. Amazing how that works, how simpler equals faster for so much code. Adding simple data structures, local explanatory functions, and the like often make code much faster.
What I’m hoping is that I will use this in a few different and useful ways.
The first way is to look for interfaces where concrete classes are being used from many other classes. You need to add an interface, but don’t know who needs which part. The goal is to figure out a relatively small number of interfaces that satisfy a number of clients in a module.
The second use would be to look for common clumps of parameters when I’m working in a large code base where the average number of arguments per function call does not remotely approach one. I suspect that there are clumps of similarly-named variables being passed around, and that these are likely “missed classes”. Sometimes these are obvious, but it would be good to see them in a nice list spit out from a nice tool.
So this is a hopeful start on a series of useful tools.
Code follows.
import shelve
import sys
def find_groups(input):
"""
Exhaustively searches for grouping of items in a map, such
that an input map like this:
"first":[1, 2, 3, 4],
"second":[1,2,3,5,6],
"third":[1,2,5,6]
will result in:
[1,2,3]: ["first","second"]
[1,2]: ["first","second","third"]
[5,6]: ["second","third"]
Note that the return value dict is a mapping of frozensets to sets,
not lists to lists as given above. Also, being a dict, the results
are effectively unordered.
"""
def tupleize(data):
"Convert a set or frozenset or list to a tuple with predictable order"
return tuple(sorted(set(data)))
def append_values(map, key, *values):
key=tupleize(key)
old_value = map.get(key,[])
new_value = list(old_value) + list(values)
new_value = tupleize(new_value)
map[key] = new_value
return key, new_value
result = {}
previously_seen = {}
for input_identity, signatures in input.iteritems():
input_signatures = set(signatures)
for signature_seen, identities_seen in previously_seen.iteritems():
common_signatures = set(signature_seen).intersection(input_signatures)
if len(common_signatures) > 1:
known_users = list(identities_seen) + [input_identity]
append_values(result, common_signatures, *known_users)
append_values(previously_seen, signatures, input_identity)
return filter(result)
def filter(subsets):
filtered = {}
for key,value in subsets.iteritems():
if (len(key) > 1) and (len(value) > 1):
filtered[key] = set(value)
return filtered
def display_groupings(groupings):
"Silly helper function to print groupings"
keys = sorted(groupings.keys(), cmp=lambda x,y: cmp(len(x),len(y)))
for key in keys:
print "\n","-"*40
for item in key:
print item
for item in sorted(groupings[key]):
print " ",item
print

Hi Tim,
Interesting little algorithm indeed.
I wrote a version of it in Java (but I can hardly imagine you’d be interested to see it) which is something “in between” the two versions you published. It took me some time and analysis to understand your first version of the algorithm and see that it was only a slight variation of the one I had written. Maybe I’ll have a look at “good old” Knuth to refresh my understanding of algorithms…
BTW: I think you no longer need the filter function in the version above as you only put entries with at least two keys and two values in the result dictionary. One more improvement, if I’m right :-).
Well, it’s not much of an algorithm. I really thought it was pretty much a brute-force kind of n^2 kind of thing. I’m not bragging on it being a smart algorithm, to be sure. Let me know if the good Dr. Knuth has a better one.
I think you’re right about the filter. It runs faster (of course) and does not fail any of the tests without. It also made me aware that my output is not ordered very well, and I can’t really compare the output of one run of the code against the output of another run against the same data. Ah, it’s always something.
It reminded me of the importance of having diffable output. I thought that removing the filter would have no effect, and I even popped in a printf, but I wanted to see that the output of the production run was identical down to the md5sum.
This seems to be the way that things get complicated.
Even with the ordering changes in the display method, we’re down to about 2.5 seconds on my box. Thanks for noticing!
Now I should run this on bigger data. :-)
Hi Tim,
This is not an improvement of the algorithm (only an optimization) but the version below (containing only the changed parts) is approximately twice as fast as your second version in a simple test on my system. Maybe it is worth it…
Note: the equivalent Java version is about thrice as fast but the Python version is more fun :-)
Is the java three times as fast including the JVM startup time? On my machine, I can’t load the jvm in .2 seconds, let alone complete all the work (including data collection) in .2 seconds. But my real reason for python is certainly the fun, and to a lesser degree the portability compared to java.
I will dig through this and make the same change on my version. Thanks again for playing. It is a fun game. I should put the source on an Mercurial server and make it available. We could maybe come up with some fun uses.
Hi Tim,
No, the JVM startup time was not included in my previous speed comparison.
On my machine (T5500 1,66GHz Dual Core CPU) starting the JVM takes about 0.2 seconds (with Java 1.6.0) – I must admit that this is much faster than I thought it would be. It therefore doesn’t change the speed ratio as soon as the actual processing requires more than two seconds or so.