Introduction to Data Compression

At first glance, compressing data to fit into fewer bits seems to be an improbable or even impossible goal. Given an 8-bit value, for example, how can that same value be expressed in less than 8 bits? In general, of course, it cannot. Data compression must be based on something other than simply "representing message values using fewer bits." This page explores why data compression can work, and under what conditions it is possible.

What is Data Compression?

Data compression can be described as "representing the information in a message using fewer bits." The information conveyed by a sequence of bits depends on what those bits represent, and that information may be expressible using fewer bits.

For example, a 1000 bit long message which is all zeros does not convey very much information. The sum total of the information can be conveyed by the single bit '0' representing the state of the signal, or perhaps by the text '1000(0)' to completly describe every bit in the message. In either case, the same information is conveyed using much less than 1000 bits.

Forms of data compression are used in different ways in our everyday lives. We often use acronyms and abbreviations to represent ideas and information using fewer characters than might otherwise be required. Given an communications acronym such as "ITU", or an abbreviation such as "datacomm" the full text can be extracted from the 'compressed' version any time it is required.

In mathematics, we use symbols and notations to express (long) numbers in 'compressed' form. The fraction 22 / 7 gives a result which has an infinite number of digits, but which can be fully described as:

Structured data can be compressed.

The result can be expressed in this way because there is some obvious structure to the repeating decimal portion.

In contrast, the number pi is irrational. It has no obvious structure and it can only be written by specifying each and every digit:

pi = 3.14159265358979323846264338327950288419716939937510...
Uncompressible data appears random.

The Structure Inherent in Various Types of Data

Many types of data have some structure that can be exploited to compress the data.

Fax Transmission : pages having various solid areas, often a lot of white in the background, not much difference between one horizontal scan line and the next.
English text : primarily alphanumeric characters and spaces, some letters occur more frequently than others. Certain letter combinations ('ch','qu','tr','and','the','ing') occur with some regularity.
English dictionary : like English text, plus the words are in alphabetical order.
Sampled Music File : consecutive samples tend to follow smooth variations in signal amplitude, periods of silence in which sample vaues remain constant.

In general, the more structure that can be found, the more compressible is the data. Compressed data is less structured than the original, appears more random, and is less amenable to further compression. It follows that a file cannot simply be compressed over and over again to make it shorter and shorter. Once the structure is gone, no more compression is possible.

Random data is not compressible. Like the value for pi, it can only be expressed by explicity specifying each and every bit.

Certain data compression methods may be designed which work well with specific types of data and not with others. In fact, applying a given data compression algorithm to unstructured data can actually lead to a file which is longer than the original.

Lossy and Lossless Compression

Data compression techniques can be divided into two categories as: "lossy" and "lossless".

Lossless Compression - Many applications demand that if a file is compressed and then de-compressed, the result should exactly match the original bit-for-bit. Lossless compression is necessary when compressing text messages and binary program code.

Lossy Compression - In some applications, a certain amount of 'loss' is tolerable if a file is compressed and then de-compressed. Lossy compression may be suitable for photographs, voice, music and video, where slightly fuzzy or distorted data may be acceptable or unnoticable to a human observer. In this case, much greater compression is possible compared to lossless compression.

It is not always possible to know in advance what type of data is to be compressed. Generalized data compression techniques must be lossless and must discover any structure by inspection of the file to be compressed. Two such methods are MNP5 and the ITU-T standard V.42 bis.

In the next section, compression of an English language dictionary is explored. - NEXT