## Saturday, April 8, 2017

### A real-life variable-length encoding challenge...

A real-life variable-length encoding challenge...  I've been poking around the idea of building a “money” class for Java.  This would be based on a purpose-made decimal floating point class, with better performance than BigDecimal and with features aimed at use in monetary calculations.  That decimal floating point class will need a way to encode the decimal exponent of the number held.

There are several interesting questions when it comes to the necessary range of that exponent.  For US dollars, a range of 10^-5 (thousandths of cents) to 10^14 (hundreds of trillions of dollars) would seem to be sufficient for just about any financial purpose.  That might not be true for other currencies, though the range of 20 orders of magnitude is probably pretty close to the mark in any currency.  For the rest of this discussion, I'll assume 20 orders of magnitude is what we need.

How should we encode that exponent?  All the existing floating point representations I'm aware of do that with a biased binary number.  For this case, they'd use a 5 bit number (the smallest number of bits that can hold 20 states) with a bias of 5 (meaning that you subtract 5 from the binary number to get the actual exponent).  So, for example, the exponent 3 would be represented by the binary number 01000 (or 8 decimal), as 8 - 5 = 3.  This scheme is very simple, allows asymmetry between the positive and negative ranges, and uses a fixed-length encoding.  If there is a requirement for fixed-length encoding, I don't know any way to beat this method.  But ... those 5 bits could hold 32 states, and we only need 20.  We could expand the range possible to store (that's the usual solution), but for the case we're interested in those extra states would be wasted.  What we'd really like would be a way to use less bits – and if two things are true, that's entirely possible.

First, it must be allowable (and useful!) to use a variable-length encoding.  I the case I'm working, that is allowable, and it's also very useful (in particular, when storing large lists of numbers, like a bookkeeping journal).  Second, the distribution of the numbers to be encoded must be uneven in a reasonably predictable way.  Intuitively it seems clear that monetary amounts in a typical bookkeeping system are distributed unevenly.  It's less clear how predictable that distribution is, but again, intuitively, it seems like at least the shape of the distribution curve should be similar for any size organization.

I happen to have a sample bookkeeping journal leftover from a past work engagement: real data, with about 40,000 numeric entries.  With just a little work, I came up with the exponent distribution at right (click to embiggen).  Three quarters of all the entries are either in the tens of dollars or hundreds of dollars range!  The frequencies fall off very quickly above and below those two exponents.  That's pretty darned uneven, which means we most definitely have an opportunity for a variable-length encoding.

Those frequencies are expressed in a normalized manner, so they all sum to 1.  That means we can use a simple formula to calculate the average encoding length for any encoding we want to test.

The first one I tried has two lengths selected by the leading bit.  The exponent would be encoded as either 0xx (for exponents 0..3) or 1xxxx (for exponents -5..-1 and 4..14).  The average bit length for this encoding on the data I have would be 0.975 * 3 + 0.025 * 5 = 3.05.  That's a pretty good payoff for such an encoding!  Note that in no case does the length exceed 5 bits, which is what we'd have with fixed-length encoding.  Also note that the first form has 4 states and the second 16 states, for a total of 20 states – exactly what we need.  There are no “wasted” states.

The second one I tried has two lengths, with a second (longer) encoding selected by an escape code.  The short code would be just two bits long, with 0..2 representing exponents 0..2, and 3 being the escape indicating a second 5 bit code following (so a total of 7 bits).  This second code would represent exponents -10..-1 and 3..24 – a range considerably larger than what we actually need.  The average bit length for this encoding on the data I have would be 0.915 * 2 + 0.085 * 7 = 2.425.  That's less than half the fixed-length encoding of 5 bits.  However, this encoding has a longer maximum length (7 bits vs. 5 bits) and a bunch of “wasted” states.

I haven't yet decided what encoding I'll actually use, though I'm pretty sure I'll be using some variable-length scheme.  Also, stealing an idea from Gustafson, I'm going to make the parameters (including length and encoding details) configurable, so no matter what I supply as default the end user can still configure his way out of any bind I create.  I would like the defaults to be well-chosen, though.  There are several things I need to investigate a bit more, especially (a) the differences between different currencies, and (b) the differences between kinds of financial applications and the size of the organization using them.  All I really have to go on there is my intuition, derived from several decades of experience.  I'd rather have hard data. :)  Don't know where I'm going to find that, though!