Archive for the ‘Programming’ Category

Twitter is my randomness entropy

Friday, October 16th, 2009

Inspired by Henry’s Facebook comment.

import urllib2, hashlib

TWITTER = 'http://twitter.com/statuses/public_timeline.json'

def hashed(seq):
    sha1 = hashlib.sha1()
    for x in seq:
        sha1.update(x)
        yield sha1.digest()

def words(wordsize, seq):
    for x in seq:
        for i in xrange(0, len(x), wordsize):
            yield int(x[i:(wordsize + i)].encode('hex'), 16)

def twitrandoms(wordsize = 4, chunksize = 37):
    def infinite_twitrandom():
        while True:
            data = urllib2.urlopen(TWITTER).read()
            for i in range(0, len(data), chunksize):
                yield data[i:(chunksize + i)]
    return words(wordsize, hashed(infinite_twitrandom()))

Help me write idiomatically correct Python

Monday, September 29th, 2008

Why can’t the features I like from every programming language be available all the time? It really bugs me that, for instance, I can’t get lazy lists and patter matching in Python and I can’t get, well, there’s got to be something that Python has that’s sort of a pain in Haskell. I/O, for example.

Anyway, I want code that does this:

import Prelude hiding (Left, Right)

data EitherBoth a = Left a | Right a | Both a a deriving Show

cmpseq :: (Ord a) => (a -> a -> Ordering) -> [a] -> [a] -> [EitherBoth a]
cmpseq _ [] rs = map Right rs
cmpseq _ ls [] = map Left  ls
cmpseq cmp (l:ls) (r:rs) = case (cmp l r) of
                             LT -> Left   l : cmpseq cmp ls  (r:rs)
                             GT -> Right  r : cmpseq cmp (l:ls) rs
                             EQ -> Both l r : cmpseq cmp ls     rs

So given two ordered lists and a comparison function, I want to know (efficiently, and preserving order) which elements are just in one list and which are in both. The Haskell version is pretty clear and pretty concise. (Bonus points: if you have a better way to do this in Haskell — and I’m sure there is one — I’d love to hear about it.)

So far, the best I can do in Python is this rather ugly mess:

def cmpseq(f, xs, ys):
    nx, ny = len(xs), len(ys)
    ix, iy = 0, 0
    while ix < nx and iy < ny:
        r = f(xs[ix], ys[iy])
        if -1 == r:
            yield (xs[ix], None)
            ix = 1 + ix
        elif 1 == r:
            yield (None, ys[iy])
            iy = 1 + iy
        else:
            yield (xs[ix], ys[iy])
            ix, iy = 1 + ix, 1 + iy
    while ix < nx:
        yield (xs[ix], None)
        ix = 1 + ix
    while iy < ny:
        yield (None, ys[iy])
        iy = 1 + iy

Yuck. I’m going to go out on a limb and say that five yields is too many. What’s the idiomatically correct way to do this? Perhaps something to do with feeding values back in to generators? Python language lawyers: please show me the error of my ways!

One-liner syndrom

Tuesday, December 4th, 2007

Jeff Atwood talks about card shuffling algorithms in a recent blog post.

As it turns out, the easiest way to implement a shuffle is by sorting. It’s not exactly faster, as the typical sort is O(n log n) compared to the O(n) of the Knuth Fisher-Yates shuffle algorithm. We’ll just sort by a random number– in this case, a GUID.

   var cards = Enumerable.Range(0, 51);
   var shuffledcards = cards.OrderBy(a => Guid.NewGuid());

So we can ultimately implement a secure, unbiased shuffle as a one-liner in a modern programming language.

The obligatory nerd response here is to bang out the one-liner in your language of choice. Here’s one for Python:

>>> from random import random
>>> shuffled = [ b for (a,b) in sorted((random(), x) for x in cards) ]

Synchronous HTTP GET in JavaScript

Wednesday, October 17th, 2007

Taking the “A” out of “AJAX.”

 function get(url) {
  var ajax = new XMLHttpRequest();
  ajax.open('GET', url, false);
  ajax.send(null);
  return ajax.responseText;
}

Of course this will be different for IE….

Roman Numeral Parser

Wednesday, September 19th, 2007

I wondered the other night how straightforward a task parsing roman numerals (up to 4999) is. As revealed by about 30 seconds of Googling: pretty straightforward. Here it is in Python:

import re

ROMAN_RE = re.compile("""
    ^                   # beginning of string
    (M{0,4})            # thousands - 0 to 4 M's
    (CM|CD|D?C{0,3})    # hundreds - 900 (CM), 400 (CD), 0-300 (0 to 3 C's),
                        #            or 500-800 (D, followed by 0 to 3 C's)
    (XC|XL|L?X{0,3})    # tens - 90 (XC), 40 (XL), 0-30 (0 to 3 X's),
                        #        or 50-80 (L, followed by 0 to 3 X's)
    (IX|IV|V?I{0,3})    # ones - 9 (IX), 4 (IV), 0-3 (0 to 3 I's),
                        #        or 5-8 (V, followed by 0 to 3 I's)
    $                   # end of string
""", re.VERBOSE)

ROMAN_DIGITS  = {
    'M'    : 1000, 'MM'   : 2000, 'MMM'  : 3000, 'MMMM' : 4000,
    'CM'   : 900,  'CD'   : 400,  'D'    : 500,  'DC'   : 600,
    'DCC'  : 700,  'DCCC' : 800,  'C'    : 100,  'CC'   : 200,
    'CCC'  : 300,  'XC'   : 90,   'XL'   : 40,   'L'    : 50,
    'LX'   : 60,   'LXX'  : 70,   'LXXX' : 80,   'X'    : 10,
    'XX'   : 20,   'XXX'  : 30,   'IX'   : 9,    'IV'   : 4,
    'V'    : 5,    'VI'   : 6,    'VII'  : 7,    'VIII' : 8,
    'I'    : 1,    'II'   : 2,    'III'  : 3
}

def rtoi(roman):
    match = ROMAN_RE.match(roman.upper())
    if match and 0 < sum(1 for x in match.groups() if 0 < len(x)):
        return sum(ROMAN_DIGITS[x] for x in match.groups() if 0 < len(x))
    return None

Pleasing small victories

Monday, September 10th, 2007

So a coworker got me hooked on Project Euler today, and I’ve been spending far too much of my free time on it. However, it did lead to a very pleasing moment.

One of the problems deals with Collatz sequences: For any number, divide by two if the number is evenly divisible by two, multiply by three and add one otherwise. Stop when you reach the number 1. The task at hand was to find the number between 1 and 1,000,000 with the longest Collatz sequence.

My first attempt at counting this sequence, rather predictably, failed on large input:

 collatzsz 1 = 1
 collatzsz n = 1 + collatzsz nxt where
     nxt = if 0 == mod n 2 then div n 2 else 3 * n + 1

Duh. Stack overflows everywhere. (Note: there’s probably a flag for GHC that will make this Just Work.) I figured that Haskell would probably go ahead and optimize tail calls for me, so I gave it a go:

 collatzsz 1 accum = 1 + accum
 collatzsz n accum = collatzsz nxt (1+accum) where
     nxt = if 0 == mod n 2 then div n 2 else 3 * n + 1

Sure enough, it works for even the most obnoxiously large values of n.

Yes, I know that this is exactly what the language is supposed to do, but sometimes it seems like a small victory when a piece of software does even that. So thanks to the good folks who make GHC. You made me smile today.

An ode to Python 3000

Tuesday, September 4th, 2007

Inspired by the what’s new page:

 def reduce(f, seq, val = None):
     if val is None: val = next(seq)        
     for x in seq:
         val = f(val, x)
     return val

Java programmers are funny

Sunday, September 2nd, 2007

Having dealt with code of the ObjectModelAdapterFactoryBuilderInterfaceFacade variety before, I of course find Stevey’s comments about Java to be spot on.

Sun Microsystems Demands University Study Retraction

The University of Washington, apparently hoping to capitalize on the recent hype around their controversial study on Baby Einsteinâ„¢-style videos, followed up yesterday with another, similar study. In the new study, researchers found that Java programmers understand an average of seven fewer Computer Science concepts per hour spent with Java each day compared to similar programmers using other languages. Sun calls the study “seriously flawed”, citing the fact that you can combine the names of Gang of Four Design Patterns to form new Computer Science concepts that all Java programmers understand, such as the ObserverFactoryBridge, the BridgeFactoryObserver, and the well-known FactoryObserverBridgeChainOfCommandSingletonProxy, beloved of Java programmers everywhere. Java experts at Sun say they’re not sure how many combinations there are of the twenty-three pattern names, but there are “definitely a lot of them.”

Haskell Unit Tests with HUnit

Thursday, August 23rd, 2007

I read an OnLamp article today called A Beautiful Regex Matcher… In Haskell. Interesting little piece of code, but there was a (silly little) bug in it. This struck me as a fine opportunity to figure out how HUnit — one of Haskell’s unit testing frameworks — works. Here’s what I came up with:

import Test.HUnit
import Regex

tests = TestList [ TestLabel (makeLabel x) (makeTest x) | x <- test_data ]
    where makeTest  tup@(reg, str, bool) = 
              TestCase $ assertEqual (makeLabel tup) bool (match reg str)
          makeLabel (reg, str, bool) | True == bool  = reg ++ " =~ " ++ str
                                     | False == bool = reg ++ " !~ " ++ str
          test_data = [ ("foo",  "foo",     True),
                        ("foo",  "bar",     False),
                        ("",     "foo",     True),
                        (".",    "Foo",     True),
                        (".*",   "Foo",     True),
                        ("F.",   "Foo",     True),
                        ("F.*",  "Foo",     True),
                        ("Fo*",  "Foo",     True),
                        ("Fo*d", "foo",     False),
                        ("Fo*d", "Foooood", True) ]

main = runTestTT tests

As mentioned in the article’s comments, the case where “.*” (“match anything”) is supposed to match “Foo” fails. Here’s the test output for the original code:

% runhaskell TestRegex.hs
### Error in:   4:.* =~ Foo                
user error (HUnit:.* =~ Foo
expected: True
 but got: False)
### Error in:   6:F.* =~ Foo               
user error (HUnit:F.* =~ Foo
expected: True
 but got: False)
 Cases: 10  Tried: 10  Errors: 2  Failures: 0
Counts {cases = 10, tried = 10, errors = 2, failures = 0}
% 

Easy enough. Finding out which cases fail makes tracking the problem down relatively straightforward. Turns out that the “matchstar” method doesn’t deal properly with periods at the end of the match. The original code:

matchstar c [] (y:ys) = c == y

Replace it with the following:

matchstar c [] (y:ys) = if c == '.' then True else c == y

And you get the following out of the unit tests:

% runhaskell TestRegex.hs
Cases: 10  Tried: 10  Errors: 0  Failures: 0
Counts {cases = 10, tried = 10, errors = 0, failures = 0}
% 

Horray!

New in Python 2.6: ORM made easy

Sunday, August 19th, 2007

From the What’s New in Python 2.6 page:

A new data type in the collections module: NamedTuple(typename, fieldnames) is a factory function that creates subclasses of the standard tuple whose fields are accessible by name as well as index. For example:

var_type = collections.NamedTuple(‘variable’, ‘id name type size’)

var = var_type(1, ‘frequency’, ‘int’, 4)

print var[0], var.id # Equivalent

print var[2], var.type # Equivalent

Neat! It all exists in libraryspace, so it’s not a groundbreaking change to the language. Still, this has to save time. Without this creating the same sort of class would have taken something like:

class variable(object):
    def __init__(self, id, name, typ, size):
      self.id, self.name, self.type, self.size = id, name, typ, size

Not too much typing, but certainly more, and you can’t even access it as a tuple without additional work. I wonder if the various Python ORM libraries use techniques like this?