For example, a 1000 bit long message which is all zeros does not convey very much information. The sum total of the information can be conveyed by the single bit '0' representing the state of the signal, or perhaps by the text '1000(0)' to completly describe every bit in the message. In either case, the same information is conveyed using much less than 1000 bits.
Forms of data compression are used in different ways in our everyday lives. We often use acronyms and abbreviations to represent ideas and information using fewer characters than might otherwise be required. Given an communications acronym such as "ITU", or an abbreviation such as "datacomm" the full text can be extracted from the 'compressed' version any time it is required.
In mathematics, we use symbols and notations to express (long) numbers in 'compressed' form. The fraction 22 / 7 gives a result which has an infinite number of digits, but which can be fully described as:
The result can be expressed in this way because there is some obvious structure to the repeating decimal portion.
In contrast, the number pi is irrational. It has no obvious structure and it can only be written by specifying each and every digit:
|
In general, the more structure that can be found, the more compressible is the data. Compressed data is less structured than the original, appears more random, and is less amenable to further compression. It follows that a file cannot simply be compressed over and over again to make it shorter and shorter. Once the structure is gone, no more compression is possible.
Random data is not compressible. Like the value for pi, it can only be expressed by explicity specifying each and every bit.
Certain data compression methods may be designed which work well with specific types of data and not with others. In fact, applying a given data compression algorithm to unstructured data can actually lead to a file which is longer than the original.
Lossless Compression - Many applications demand that if a file is compressed and then de-compressed, the result should exactly match the original bit-for-bit. Lossless compression is necessary when compressing text messages and binary program code.
Lossy Compression - In some applications, a certain amount of 'loss' is tolerable if a file is compressed and then de-compressed. Lossy compression may be suitable for photographs, voice, music and video, where slightly fuzzy or distorted data may be acceptable or unnoticable to a human observer. In this case, much greater compression is possible compared to lossless compression.
It is not always possible to know in advance what type of data is to be compressed. Generalized data compression techniques must be lossless and must discover any structure by inspection of the file to be compressed. Two such methods are MNP5 and the ITU-T standard V.42 bis.
In the next section, compression of an English language dictionary is explored. - NEXT