[This entry is a bit long. Maybe I should consider removing it
from this 'blog and putting it on a page of its own, just like I did
for my countries page.
To jump directly to the next entry, click
here.]
I was thinking, the other day, about lossy compression techniques
(of sound, images, video and so on) because I had been reading yet
another endless discussion (I forget where) as to whether Ogg Vorbis is, or not,
better than MP3 / WMA / whatever, and I'd like to share a few of a
(pure) mathematician's thoughts.
The point of a compression method is to take some
initial data (an image, piece of sound, video, or whatever) and
represent them as a sequence of bits (the compressed
representation) in a way that takes less bits (in typical
situations) than the “naïve” way of representing the data.
If the compressed representation determines the initial data
exactly then the compression is called lossless,
otherwise it is lossy.
It is a (pretty obvious) theorem that there is no generic lossless
compression technique: precisely, there is no way to compress random
data (in the statistical sense, that is)—a fact which did not
prevent some idiots from trying to patent various techniques for
compressing random data, but no matter. So all forms of compression
depend on particularities of the input data: and, indeed, we rarely
store completely random data (though encrypted data are sometimes
stored, and, in the absence of the knowledge of the key, it is
essentially random if the cipher is of any value—this is why
data should always be compressed before they are encrypted,
because afterward it is too late). The best thing that comes to a
generic compression method would be to store the smallest number of a
Turing machine that produces the data considered when starting from a
blank string: it is essentially the best possible because it
compresses any amount of data precisely down to its Kolmogoroff
complexity (essentially this is the very definition of
Kolmogoroff complexity). Unfortunately, this computation method is
not computable (even in theory), because the halting problem is
undecidable. A theoretically possible method consists of choosing
some diagonal enumeration of Turing machine numbers and limit times
and performing the parallel execution following the chosen diagonal:
this gives the subject-to-constraint Kolmogoroff complexity for the
given choice of the diagonal enumeration, which is, essentially, a
size-to-decompression-performance tradeoff. Unfortunately, even this
theoretically possible best compression technique (which is trivial to
implement and asymptotically outperforms any other known
compression technique, with the desirable tradeoff choice) is unusable
in practice because it would take just about forever to compress
“Hello, world!” (and probably not be too good at it even).
Instead, further approximations are used on this scheme which make the
decompression engine very limited (rather than being a full Turing
machine, even with a complexity tradeoff / limitation). The RFC 1951
(“deflate”) compression technique, used by gzip, is of this kind. Of course,
there are lossless compression techniques that are optimized for some
very specialized kinds of input data, for example FLAC for lossless compression of sounds.
But let us turn to lossy compression. Of course one possible lossy
compression method would be to compress everything to the empty string
(totally lossy compression method), but that's hardly useful, except
possibly for compressing WebLogs. Normally the idea is that it
should be possible to determine, from the compressed representation of
the data, some approximation of the initial data that, to the eye, ear
or whatever, of the beholder, is essentially indistinguishable from
these initial data. The popular JPEG file format (which properly ought to be
called JFIF
,
actually) is of this kind, and so is MP3 and many other audio (and
nearly all video) compression codecs. Value of a lossy compression
technique can be judged on two counts: one is objective (the amount of
compression) and another (the quality of the
compressed-then-uncompressed representation, relative to the initial
sample) is either objective or subjective (the subjective measure
being typically the more interesting one, but also harder to work
with). Of course, compression and quality depend, for a given
implementation, on the sample chosen, and it is very delicate, of not
impossible, to compare two codecs, let alone two standards.
At this point I should introduce in my thoughts another property of
compression standards, which in my opinion is never sufficiently
emphasized: that of being “determined” or
“undetermined”. Let us say that compression is
determined when given input data will always (per standard)
be compressed as the same binary representation,
undetermined otherwise. Similarly, decompression
can be determined or undetermined according as a given (compressed)
representation will always give (exactly) the same data on
decompression. A lossless compression standard must be
determined on decompression, but the converse is not true, and being
determined on compression is quite independent from being lossless
(actually, there are probably no compression standards whatsoever that
are determined on compression, or just very trivial ones, for it would
be rather silly). To take an example, the “deflate”
compression standard is determined on decompression, as a given
compressed binary representation will always deflate to exactly the
same initial data, but it is certainly not determined on compression,
and in fact the gzip
program has various options
controlling the quality-versus-speed tradeoff on compression, which
produce different compressed representations on the same initial
sample. Lossy compression formats can be either determined or
undetermined on decompression: it is harder to specify things if not
by making the standard determined, but things may be determined, for
example, up to roundoff errors (within prescribed bounds). I have no
idea, to give one example, whether a given JPEG image
file always determines exactly the same pixel values no matter which
decoder is used. When either compression or decompression is
undetermined, care must be taken to differentiate the format
(or standard) and the codec (or
implementation of the standard) which conforms to it. It is
trivial, for example, to produce an implementation of the
“deflate” compressor which does not compress at all, and
that this is worthless does not prove that “deflate” is
worthless. Of course, for lossy compression, undeterminacy on
compression comes from two different accounts: one is that the loss
(the result of compression and decompression) might not be determined,
and the other is that even when the result of decompression is fixed
(as it necessarily is for lossless techniques), compression might
still not be determined.
Now I discuss another property of compression methods which is
rarely mentioned, yet which I find of the highest importance. Start
with some initial data, compress them; then uncompress them, and
compress them again. Compare the results of the two compressions.
Obviously, if we're talking lossless compression, and provided the
compressor is deterministic (which is certainly the case if
compression is determined), they must be identical (there might be a
difference in timestamp or something unimportant of the kind). For
lossy compression, this is not necessarily the case: let us say that
the compression is idempotent when it is. We can also say
that this compression is a “projector”: it projects a
given data sample to the “closest” (in an adequate, and
perhaps subjective, sense) sample which can result from uncompressing
some compressed representation. This is an eminently desirable
property, because it implies that if the same data are repeatedly
compressed and uncompressed, no further quality loss will be incurred
than the initial quality loss on the first compression. Perhaps this
is asking too much: but at least we can demand that the act of
iteratively compressing and uncompressing does not diverge, but rather
converges to some data which are not too badly degraded with
respect to the initial data. So besides the quality of a codec (which
is the residual quality, for a given sample, after one compression and
decompression) we can also define its stable (or
asymptotic) quality, which is the quality after an infinite
number of compressions and decompressions (provided this converges,
otherwise take the limes inferior of the qualities in the sequence).
Sadly, the stable quality is rarely mentioned, despite being (in my
opinion) a very important property of a given codec.
I have experimented with a given implementation of
JPEG (not JPEG2000) and have
discovered that (for this given implementation, at least), as the
quality of compression goes up, the asymptotic quality goes
down! That is, performing a million
compression-uncompression cycles on an image with JPEG
quality set to 30% (a very low figure) will do essentially no further
damage than two cycles (and might give a pixel-for-pixel identical
result, in fact), whereas performing a million cycles with quality set
to 100% (a gross lie, by the way, because even at 100%
JPEG is not lossless) degrades the image a million times,
with a much poorer final result (it is intensely blurred).
Of course, even asymptotic quality might not be a sufficient
criterion. Ideally one would want the following: that taking the
data, performing some kind of sensible transformation on them (such
as, scaling, rotating, resampling, mixing, whatever), compressing,
uncompressing, performing another transformation, compressing,
uncompressiong, performing yet another transformation, and so on,
should in the end give a quality similar to that which would have been
obtained by performing all transformations on the inital data,
compressing and uncompressing only once. This means that,
essentially, the compression method is such that the data can be
reworked on the compressed format. In practice this is never true,
except if compression is really lossless, and that is the best
argument for preferring lossless compression formats: if you use a
lossy compression format, and every day you rework some image, or
remix some sound sample, you'll be losing quality each day when
compressing and uncompressing. So even if at each step the quality
loss is imperceptible to the eye, or ear, in the end it might be
horrible. Alas, there is no real cure for this but to use lossless
compression formats, or to bear the quality loss (and perform as few
as possible uncompress-rework-recompress cycles, so as to minimize
it). I suppose for video the pain must be borne to some extent
(storing two hours of video at 25fps, resolution 720×576, say, requires 200GB of
storage, which is now becoming acceptable, but it's still rather
high). Still, one would want lossy compression formats to be at least
relatively “well-behaved” in the respect of quality loss
when doing multiple compressions. I'm not sure how far this can be
achieved.
Another feasible direction would be to demand the possibility of
reworking the data, to some extent, directly on their compressed
representation. For example, in a compressed image, one might like to
be able to crop or resize the image directly on the compressed
representation. Also, if the compression format supports several
“qualities”, one would typically want to be able to obtain
a lower-quality (smaller) compression from a higher-quality one in a
way that gives just as good a quality-versus-size tradeoff as had one
proceeded from the initial (uncompressed) data: this is known as
bit-peeling, and it is another eminently desirable property
of a codec that it support bit-peeling. (To emphasize, this is not
nearly as stringent a requirement as that which consists of asking
that when the data are uncompressed and recompressed at a different
quality, there be no further quality loss than that of the lower
quality compression.) Unfortunately, I know of very few lossy
compression formats that are bit-peelable: I have been told that
JPEG2000 is such, and that Ogg Vorbis is in
theory (though what is meant by that is dubious), but I don't
know much more.
I think I'll stop here, otherwise I'll start ranting again against
stupid patents on compression techniques. Incidentally, I should
mention that Ogg (not Vorbis, mind) is now official as RFC 3533.