David Madore's WebLog: Some random thoughts on lossy compression methods

[This entry is a bit long. Maybe I should consider removing it from this 'blog and putting it on a page of its own, just like I did for my countries page. To jump directly to the next entry, click here.]

I was thinking, the other day, about lossy compression techniques (of sound, images, video and so on) because I had been reading yet another endless discussion (I forget where) as to whether Ogg Vorbis is, or not, better than MP3 / WMA / whatever, and I'd like to share a few of a (pure) mathematician's thoughts.

The point of a compression method is to take some initial data (an image, piece of sound, video, or whatever) and represent them as a sequence of bits (the compressed representation) in a way that takes less bits (in typical situations) than the “naïve” way of representing the data. If the compressed representation determines the initial data exactly then the compression is called lossless, otherwise it is lossy.

It is a (pretty obvious) theorem that there is no generic lossless compression technique: precisely, there is no way to compress random data (in the statistical sense, that is)—a fact which did not prevent some idiots from trying to patent various techniques for compressing random data, but no matter. So all forms of compression depend on particularities of the input data: and, indeed, we rarely store completely random data (though encrypted data are sometimes stored, and, in the absence of the knowledge of the key, it is essentially random if the cipher is of any value—this is why data should always be compressed before they are encrypted, because afterward it is too late). The best thing that comes to a generic compression method would be to store the smallest number of a Turing machine that produces the data considered when starting from a blank string: it is essentially the best possible because it compresses any amount of data precisely down to its Kolmogoroff complexity (essentially this is the very definition of Kolmogoroff complexity). Unfortunately, this computation method is not computable (even in theory), because the halting problem is undecidable. A theoretically possible method consists of choosing some diagonal enumeration of Turing machine numbers and limit times and performing the parallel execution following the chosen diagonal: this gives the subject-to-constraint Kolmogoroff complexity for the given choice of the diagonal enumeration, which is, essentially, a size-to-decompression-performance tradeoff. Unfortunately, even this theoretically possible best compression technique (which is trivial to implement and asymptotically outperforms any other known compression technique, with the desirable tradeoff choice) is unusable in practice because it would take just about forever to compress “Hello, world!” (and probably not be too good at it even). Instead, further approximations are used on this scheme which make the decompression engine very limited (rather than being a full Turing machine, even with a complexity tradeoff / limitation). The RFC 1951 (“deflate”) compression technique, used by gzip, is of this kind. Of course, there are lossless compression techniques that are optimized for some very specialized kinds of input data, for example FLAC for lossless compression of sounds.

But let us turn to lossy compression. Of course one possible lossy compression method would be to compress everything to the empty string (totally lossy compression method), but that's hardly useful, except possibly for compressing WebLogs. Normally the idea is that it should be possible to determine, from the compressed representation of the data, some approximation of the initial data that, to the eye, ear or whatever, of the beholder, is essentially indistinguishable from these initial data. The popular JPEG file format (which properly ought to be called JFIF, actually) is of this kind, and so is MP3 and many other audio (and nearly all video) compression codecs. Value of a lossy compression technique can be judged on two counts: one is objective (the amount of compression) and another (the quality of the compressed-then-uncompressed representation, relative to the initial sample) is either objective or subjective (the subjective measure being typically the more interesting one, but also harder to work with). Of course, compression and quality depend, for a given implementation, on the sample chosen, and it is very delicate, of not impossible, to compare two codecs, let alone two standards.

At this point I should introduce in my thoughts another property of compression standards, which in my opinion is never sufficiently emphasized: that of being “determined” or “undetermined”. Let us say that compression is determined when given input data will always (per standard) be compressed as the same binary representation, undetermined otherwise. Similarly, decompression can be determined or undetermined according as a given (compressed) representation will always give (exactly) the same data on decompression. A lossless compression standard must be determined on decompression, but the converse is not true, and being determined on compression is quite independent from being lossless (actually, there are probably no compression standards whatsoever that are determined on compression, or just very trivial ones, for it would be rather silly). To take an example, the “deflate” compression standard is determined on decompression, as a given compressed binary representation will always deflate to exactly the same initial data, but it is certainly not determined on compression, and in fact the gzip program has various options controlling the quality-versus-speed tradeoff on compression, which produce different compressed representations on the same initial sample. Lossy compression formats can be either determined or undetermined on decompression: it is harder to specify things if not by making the standard determined, but things may be determined, for example, up to roundoff errors (within prescribed bounds). I have no idea, to give one example, whether a given JPEG image file always determines exactly the same pixel values no matter which decoder is used. When either compression or decompression is undetermined, care must be taken to differentiate the format (or standard) and the codec (or implementation of the standard) which conforms to it. It is trivial, for example, to produce an implementation of the “deflate” compressor which does not compress at all, and that this is worthless does not prove that “deflate” is worthless. Of course, for lossy compression, undeterminacy on compression comes from two different accounts: one is that the loss (the result of compression and decompression) might not be determined, and the other is that even when the result of decompression is fixed (as it necessarily is for lossless techniques), compression might still not be determined.

Now I discuss another property of compression methods which is rarely mentioned, yet which I find of the highest importance. Start with some initial data, compress them; then uncompress them, and compress them again. Compare the results of the two compressions. Obviously, if we're talking lossless compression, and provided the compressor is deterministic (which is certainly the case if compression is determined), they must be identical (there might be a difference in timestamp or something unimportant of the kind). For lossy compression, this is not necessarily the case: let us say that the compression is idempotent when it is. We can also say that this compression is a “projector”: it projects a given data sample to the “closest” (in an adequate, and perhaps subjective, sense) sample which can result from uncompressing some compressed representation. This is an eminently desirable property, because it implies that if the same data are repeatedly compressed and uncompressed, no further quality loss will be incurred than the initial quality loss on the first compression. Perhaps this is asking too much: but at least we can demand that the act of iteratively compressing and uncompressing does not diverge, but rather converges to some data which are not too badly degraded with respect to the initial data. So besides the quality of a codec (which is the residual quality, for a given sample, after one compression and decompression) we can also define its stable (or asymptotic) quality, which is the quality after an infinite number of compressions and decompressions (provided this converges, otherwise take the limes inferior of the qualities in the sequence). Sadly, the stable quality is rarely mentioned, despite being (in my opinion) a very important property of a given codec.

I have experimented with a given implementation of JPEG (not JPEG2000) and have discovered that (for this given implementation, at least), as the quality of compression goes up, the asymptotic quality goes down! That is, performing a million compression-uncompression cycles on an image with JPEG quality set to 30% (a very low figure) will do essentially no further damage than two cycles (and might give a pixel-for-pixel identical result, in fact), whereas performing a million cycles with quality set to 100% (a gross lie, by the way, because even at 100% JPEG is not lossless) degrades the image a million times, with a much poorer final result (it is intensely blurred).

Of course, even asymptotic quality might not be a sufficient criterion. Ideally one would want the following: that taking the data, performing some kind of sensible transformation on them (such as, scaling, rotating, resampling, mixing, whatever), compressing, uncompressing, performing another transformation, compressing, uncompressiong, performing yet another transformation, and so on, should in the end give a quality similar to that which would have been obtained by performing all transformations on the inital data, compressing and uncompressing only once. This means that, essentially, the compression method is such that the data can be reworked on the compressed format. In practice this is never true, except if compression is really lossless, and that is the best argument for preferring lossless compression formats: if you use a lossy compression format, and every day you rework some image, or remix some sound sample, you'll be losing quality each day when compressing and uncompressing. So even if at each step the quality loss is imperceptible to the eye, or ear, in the end it might be horrible. Alas, there is no real cure for this but to use lossless compression formats, or to bear the quality loss (and perform as few as possible uncompress-rework-recompress cycles, so as to minimize it). I suppose for video the pain must be borne to some extent (storing two hours of video at 25fps, resolution 720×576, say, requires 200GB of storage, which is now becoming acceptable, but it's still rather high). Still, one would want lossy compression formats to be at least relatively “well-behaved” in the respect of quality loss when doing multiple compressions. I'm not sure how far this can be achieved.

Another feasible direction would be to demand the possibility of reworking the data, to some extent, directly on their compressed representation. For example, in a compressed image, one might like to be able to crop or resize the image directly on the compressed representation. Also, if the compression format supports several “qualities”, one would typically want to be able to obtain a lower-quality (smaller) compression from a higher-quality one in a way that gives just as good a quality-versus-size tradeoff as had one proceeded from the initial (uncompressed) data: this is known as bit-peeling, and it is another eminently desirable property of a codec that it support bit-peeling. (To emphasize, this is not nearly as stringent a requirement as that which consists of asking that when the data are uncompressed and recompressed at a different quality, there be no further quality loss than that of the lower quality compression.) Unfortunately, I know of very few lossy compression formats that are bit-peelable: I have been told that JPEG2000 is such, and that Ogg Vorbis is in theory (though what is meant by that is dubious), but I don't know much more.

I think I'll stop here, otherwise I'll start ranting again against stupid patents on compression techniques. Incidentally, I should mention that Ogg (not Vorbis, mind) is now official as RFC 3533.