Tuesday, October 28, 2014

Scientists and floating point math...

Scientists and floating point math...  An article I read this morning makes a valiant attempt to demystify floating point math for scientists.  I'm not sure it succeeds :)  Reading it reminded me of an experience I had back in the early '80s.

A San Diego company had themselves a real problem.  They'd contracted with an independent consulting engineer (very common back then for anything involving microcomputers) to develop a TTY modem that ran on a single chip microcomputer.  This was to be part of a device to allow deaf people to use telephones.  Telephone companies had recently been mandated to provide such devices for deaf customers, and this company was trying to win the bid to supply the local telephone company with thousands of them.

The problem was that the engineer they hired had quit in a huff, and claimed to have accidentally destroyed the source code for the firmware on the chip.  All the company had was a single mostly-working chip, from which the object code could be read.  They wanted someone to reverse-engineer the code, deliver the reconstructed source code, and (ideally) to fix the problems.

I took that contract, but only on an hourly rate basis, as I had no idea what I was going to find.  The company didn't like that much, but they also had no alternatives.

The firmware had originally been written in assembly language, as most performant things were back then.  That was a mixed blessing.  It was easy to get the source code back without useful symbol names – but quite challenging to actually understand the code and assign useful symbol names.  I spent several weeks totally immersed in that code.

A big part of the code turned out to be a floating point package, one that the previous engineer had apparently written himself.  I had just written one of these a few years prior (as part of Tarbell Basic), and I'd also written one while in the Navy, so I had some familiarity with how to do this.  This particular package had some trigonometric functions in it that were used by the modem software – and those trig functions were written in a way that caused cumulative errors of exactly the kind described in the linked article.

So I went back to the company who hired me and told them that I'd found a significant problem, and it was one that could be fixed.  Naturally, they wanted to know what the problem was.  While explaining it to the engineering team, I discovered that the source of the trig function algorithm was a scientist who worked for the company: he had handed that algorithm to the former programmer, who implemented it as asked.  So the source of the problem wasn't the programmer, but rather this scientist (a physics guy who knew about digital signal processing).  Next thing I know, I'm lecturing this scientist on the details of how floating point worked, and he really, really didn't want to hear it :)  In the end, the only way I could convince him was by constructing simple test cases and showing him the results.

There was a happy ending to all this, though.  Once I convinced the scientist of the errors consequent to his algorithm, he was willing to listen to some alternatives.  One of them was a well-known approach that avoided the cumulative error problem and also was many, many times faster than his algorithm: a simple polynomial approximation.  I didn't invent this; I got it from a book written in the '50s by someone with the appropriate degree.  Even the scientist was happy with this one!


No comments:

Post a Comment