Saturday, April 8, 2017

A real-life variable-length encoding challenge...

A real-life variable-length encoding challenge...  I've been poking around the idea of building a “money” class for Java.  This would be based on a purpose-made decimal floating point class, with better performance than BigDecimal and with features aimed at use in monetary calculations.  That decimal floating point class will need a way to encode the decimal exponent of the number held.

There are several interesting questions when it comes to the necessary range of that exponent.  For US dollars, a range of 10^-5 (thousandths of cents) to 10^14 (hundreds of trillions of dollars) would seem to be sufficient for just about any financial purpose.  That might not be true for other currencies, though the range of 20 orders of magnitude is probably pretty close to the mark in any currency.  For the rest of this discussion, I'll assume 20 orders of magnitude is what we need.

How should we encode that exponent?  All the existing floating point representations I'm aware of do that with a biased binary number.  For this case, they'd use a 5 bit number (the smallest number of bits that can hold 20 states) with a bias of 5 (meaning that you subtract 5 from the binary number to get the actual exponent).  So, for example, the exponent 3 would be represented by the binary number 01000 (or 8 decimal), as 8 - 5 = 3.  This scheme is very simple, allows asymmetry between the positive and negative ranges, and uses a fixed-length encoding.  If there is a requirement for fixed-length encoding, I don't know any way to beat this method.  But ... those 5 bits could hold 32 states, and we only need 20.  We could expand the range possible to store (that's the usual solution), but for the case we're interested in those extra states would be wasted.  What we'd really like would be a way to use less bits – and if two things are true, that's entirely possible.

First, it must be allowable (and useful!) to use a variable-length encoding.  I the case I'm working, that is allowable, and it's also very useful (in particular, when storing large lists of numbers, like a bookkeeping journal).  Second, the distribution of the numbers to be encoded must be uneven in a reasonably predictable way.  Intuitively it seems clear that monetary amounts in a typical bookkeeping system are distributed unevenly.  It's less clear how predictable that distribution is, but again, intuitively, it seems like at least the shape of the distribution curve should be similar for any size organization.

I happen to have a sample bookkeeping journal leftover from a past work engagement: real data, with about 40,000 numeric entries.  With just a little work, I came up with the exponent distribution at right (click to embiggen).  Three quarters of all the entries are either in the tens of dollars or hundreds of dollars range!  The frequencies fall off very quickly above and below those two exponents.  That's pretty darned uneven, which means we most definitely have an opportunity for a variable-length encoding.

Those frequencies are expressed in a normalized manner, so they all sum to 1.  That means we can use a simple formula to calculate the average encoding length for any encoding we want to test.

The first one I tried has two lengths selected by the leading bit.  The exponent would be encoded as either 0xx (for exponents 0..3) or 1xxxx (for exponents -5..-1 and 4..14).  The average bit length for this encoding on the data I have would be 0.975 * 3 + 0.025 * 5 = 3.05.  That's a pretty good payoff for such an encoding!  Note that in no case does the length exceed 5 bits, which is what we'd have with fixed-length encoding.  Also note that the first form has 4 states and the second 16 states, for a total of 20 states – exactly what we need.  There are no “wasted” states.

The second one I tried has two lengths, with a second (longer) encoding selected by an escape code.  The short code would be just two bits long, with 0..2 representing exponents 0..2, and 3 being the escape indicating a second 5 bit code following (so a total of 7 bits).  This second code would represent exponents -10..-1 and 3..24 – a range considerably larger than what we actually need.  The average bit length for this encoding on the data I have would be 0.915 * 2 + 0.085 * 7 = 2.425.  That's less than half the fixed-length encoding of 5 bits.  However, this encoding has a longer maximum length (7 bits vs. 5 bits) and a bunch of “wasted” states.

I haven't yet decided what encoding I'll actually use, though I'm pretty sure I'll be using some variable-length scheme.  Also, stealing an idea from Gustafson, I'm going to make the parameters (including length and encoding details) configurable, so no matter what I supply as default the end user can still configure his way out of any bind I create.  I would like the defaults to be well-chosen, though.  There are several things I need to investigate a bit more, especially (a) the differences between different currencies, and (b) the differences between kinds of financial applications and the size of the organization using them.  All I really have to go on there is my intuition, derived from several decades of experience.  I'd rather have hard data. :)  Don't know where I'm going to find that, though! 

Paradise ponders, slippery mud, wet wood, and sushi edition...

Paradise ponders, slippery mud, wet wood, and sushi edition...  I was able to spend most of day yesterday (and some this morning, too!) working on my stairs project.  In the first photo below, you see the results of the first round of sanding off the black (using 100 grit and the orbital sander).  The large landing has one round of sanding done (second photo is a closeup), while the smaller step is unsanded.  Quite a difference!  I ended up doing two rounds at 100 grit and a final round at 220 grit, and in the end I was pretty happy with how it looked.  The black color stayed in the dents as I'd originally intended, but it also was infused in some other areas in a way that resembles how wood weathers.  All good!  Then I started applying stain: two coats yesterday, another this morning.  The result (still wet) is in the third photo.  I think I'm going to stop at three, as it looks pretty good.  The stain has the side-effect of making the blackened areas a bit indistinct, again more like wood actually weathers.  Once these stairs are in the sun room, the little bit of red still visible will gradually fade, and the white parts will yellow slightly – both of these effects will, I think, improve the appearance.  So far I've only stained the top; later today I'll flip it over and start the process on the bottom.  By tomorrow I believe I'll be applying the matte polyurethane...

We went out for our lunch yesterday to a local favorite: Black Pearl.  I had sushi (my seafood nanudo roll at right); Debbie had their house pad thai.  Both were excellent!  We were both stuffed by the time we rolled out of there, and Debbie had some leftovers, too.  Not me. :)  On the way there we had an odd thing happen with the Model X.  From appearances, the GPS stopped reporting the car's position to the navigation system.  The car's symbol was stuck in one place on the map.  Nothing I did made any difference at all.  After we ate and started home, the position was still stuck – but only for a couple of blocks.  Then it suddenly started working again, and has been since then.  I've no idea what might have caused that.  The Tesla forums have a couple of mentions of the same problem, with the exact same outcome, so I'm not too worried about it.

Last night and early this morning we got slammed with a big rainstorm.  The snapshot at right shows the total storm accumulation; we got just over a half inch.  That wouldn't be so bad except that our soils were already super-saturated with water – so nearly all the water from this storm is either sitting on the surface or turning the first inch or so of soil into a slurry.  Some areas to our west and southwest got over 2" in this storm.  These are areas that are normally very dry, so this will likely cause a burst of plant growth – and wildflowers!  We have another storm approaching, too – starting around noon today we're expecting rain for 12 hours or so, with another half inch or a bit more in precipitation.  All this water is going to make our sprinkler project quite a bit harder.  I suspect Mark T. will be waiting a few days for things to dry out at least a little.  After today we've got 10 days with little chance of rain in the forecast, with highs in the upper 50s or low 60s, light winds, and lots of sun.  A few days of that and we'll dry out a bit, I hope!  The farmers around here who didn't plant this week are going to have to wait another week or so before their tractors can make it through the fields...