Tuesday, February 23, 2016

Zipf's Law holds...  Zipf's Law is one of those observed mathematical relationships that is surprising because it's so simple – and because there's no mechanism to explain it.  I first ran into it when studying cryptography in the Navy, back in the '70s.  Basically, Zipf's law says that in any given text, the most common word will appear twice as often as the second most common word, three times as often as the third most common, and so on.  If you're trying to decrypt the bad guy's messages, and the bad guys used an old-fashioned cipher or code, knowing the way that words (and letters) are distributed can help.  That's much less true with modern ciphers, though.

Anyway, I remember thinking at the time that Zipf's Law was too simple to be for real.  I imagined that it was an artifact of military messages, because of the strange, passive voice filled, stilted, and acronym-stuffed prose the military favored.  Turns out that's not the case – in a recent test against the entire text held by the Gutenberg Project, Zipf's Law largely still holds.  How bizarre!

