Saturday, August 10, 2013

Oops!

I'm a reasonably technical guy, but this really surprised me when I first read of it: a broad range of Xerox brand copiers can change the numbers in a document being copied.  Xerox has owned up to the problem, and appears to be approaching the resolution of it in a constructive way.  It may not be just Xerox copiers, either.  As David Kriesel (the researcher who first reported the issue) points out, this can have consequences that are costly (e.g., invoice with the wrong numbers on it) or even deadly (e.g., a bridge design with the wrong specifications).

My expectation of copiers is that they will produce copies that are perfect duplicates of the original, within the resolution constraints of the device.  They might have optical distortion, color distortion, streaks, etc. – but they shouldn't just move shit around or make shit up!

But truthfully, I should have known better.  After all, I have some idea how a modern copier actually works.  They do not work like many people imagine: kind of like a camera.  Not at all.  Instead, every modern copier is actually a small, special-purpose computer with (at minimum) a scanner and a printer attached.  Fancy copiers may also have a whole bunch of paper handling mechanism attached, and possibly multiple scanners and printers, too.  But for this discussion, let's stick with a simple copier like you might have in your home office.  We have one such copier, which also works as a scanner and a printer – a multifunction device.

These copiers work like this when you press the “Copy” button:
  1. The original document is scanned into the computer's memory as an image.
  2. The image is compressed and stored on the computer's hard disk.
  3. The image is decompressed and printed.
The second step is where the problem lies.  Actually two problems.

The first one has been long recognized, and isn't the main subject of this post: many (probably most, actually) copiers store copies of the documents scanned on the hard disk indefinitely.  The hard disks can easily store millions of pages of documents, so there's not even much danger of filling one up.  That's a security problem for just about anybody, even home owners, as someone could steal that hard disk and retrieve copies of anything you've ever copied. 

The other problem is the new one, and it's related to the image compression step.  “Compression” is a method for reducing the size of the image files the copier saves to disk.  The more compression, the more pages that can be stored.  The Xerox copiers (and many others as well) allow users to configure exactly what kind of compression is used.  The most important choice is between lossless and lossy compression.  Lossless compression will make a perfect copy of the document, but without as much compression.  Lossy compression will make an imperfect copy of the document, but with (much) more compression.  Lossy compression methods (algorithms) achieve their higher compression by compromising on some aspect of the copy.  There are many different ways to do this, with different consequences for the quality of the resulting image copy.

One of the lossy compression choices Xerox provides is an algorithm called JBIG2.  The preceding link has a good general discussion of how JBIG2 works, but here's the relevant bit for this issue: for text (including numerical data), JBIG2 recognizes characters (much like OCR, if you're familiar with that) on the page.  This recognition process is far from perfect, and can easily result in one character being mistaken for another.  The JBIG2 algorithm keeps track of the context of what its compressing – if it sees a column of numbers, it's much more likely to match another number than it is to match a letter.  So the mistakes it makes in recognizing characters to match can, in fact, result in choosing the wrong number.

So what?  Well, let's say the original document has a “5”, but JBIG2 matches a “2” instead.  The copy that gets printed will have a “2” where the “5” should have been.  Oops.

The actual process is much more complicated than I just described.  There's one element of that complexity that matters for this discussion: JBIG2 can be configured for different degrees of “looseness” on the character matching.  The problem I described really only matters when this configuration is set for high levels of compression (which use loose matches).

Xerox has a firmware patch that fixes the problem by the simple expedient of disabling JBIG2 compression.  As Mr. Kriesel points out, getting that patch to all of Xerox's copiers in the field is not simple.  Inevitably there will be customers who never get the message – and if they've configured their copier(s) to use JBIG2 at high compression, they could be in for some very rude – and possibly costly or deadly – surprises...

A big oops!

No comments:

Post a Comment