Practical Data Compression

Successful data compression depends on the presence of structure or patterns in the file to be compressed. Files containing English text as ASCII characters possess many features which allow compression. Moreover, an alphabetical listing of English words (such as might be used by a spell checking program) has structure which begs for compression. In this section, it will be shown that a dictionary listing English words can easily be compressed to less than 1/3 its original file size. 

Compression of English Text

English text is distinguishable from random characters by the fact that certain letters occur much more frequently than others. (This is reflected in the points assigned to various letters in certain board games.) This observation can form the basis for a compression scheme in which letters are represented by variable length codewords, where the shortest codewords are used for the most frequently occurring characters.

This problem was identified by Samuel Morse in defining a very early digital coding scheme. In the Morse Code, telegraph messages are sent using combinations of 'dots' and 'dashes'. Morse reasoned that telegraph operators could work more efficiently if the common letters could be sent more quickly than the others. In the table below showing the International Morse Code, the letters 'E' and 'T' are sent as single symbols, while 'Z' and 'Q' and 'J' each require four symbols to transmit. This code can be compared to a more recent character code by noting that in ASCII the same number of bits are used for every character.

The Morse Code = Digital Compression
A: .-     B: -...   C: -.-.   D: -..    E: .    
F: ..-.   G: --.    H: ....   I: ..     J: ---.
K: -.-    L: .-..   M: --     N: -.     O: ---
P: .--.   Q: --.-   R: ._.    S: ...    T: -
U: .._    V: ...-   W: .--    X: -..-   Y: -.--   Z: --..
The code used by Morse illustrates a fundamental principle of data compression. Here, different codewords are assigned to letters such that the shortest codewords are assigned to the most frequently occurring letters. If the dots and dashes are written as binary 1's and 0's, then these alphabet characters can be represented using between 1 and 4 bits. There are two 1-bit codewords {E,T}, four 2-bit codewords {I,A,N,M}, eight 3-bit codewords {S,U,R,W,D,K,G,O}, and sixteen 4-bit codewords, of which only twelve are needed to encode the remaining alphabet words, as shown below:
 1-BIT   2-BITS   3-BITS   4-BITS 
E,T  I,A,N,M   S,U,R,W,D,O,G,K   H,V,F,L,P,J,B,X,C,Z,Q,Y 
Fewer bits are used for the most common letters.

The above scheme might be useful in a modern digital communications system, except that is difficult to identify the beginning and end of characters in a binary stream. In a telegraph system, pauses are inserted between characters to separate the codewords.

 S   E  N    D      M    O    R   E     M    O   N   E   Y
...  .  -.  _..     --  ---  -.-  .     --  ---  -.  .  -.-- 
000  0  10  100     11  111  101  0     11  111  10  0  1011
unfortunately, as a binary stream, the message "00001010011111100111111010111111001011" cannot be separated into characters.

In a binary data compression scheme, variable length codewords must be separated from each other. In the next example, the alphabet is sorted and a practical data compression scheme is described.

Note that the patterns present in English text go much deeper than single characters, and certain letter pairs ('TH', 'QU', 'CK' ) occur often enough that codes which recognized these patterns could be very efficient. Indeed, some words are so common that a few bits could encode entire strings of characters. In this example, only single letters will be coded. 

Compression with "4 of 8" Coding

English text can be compressed because of the structure inherent in words and sentences. Certain characters occur much more frequently than others {such as E,T}, while some characters are used only rarely {Q,Z}.

Divide the alphabet into the 13 most common and the 13 least common letters. Using the Morse distribution as a guide:

HEX (4-bits) 0 1 2 3 4 5 6 7 8 9 A B C D E F
ROW 1 E T I A N M S U R W D O G spc up end
ROW 2 K H V F L P J B X C Z Q Y .. up ..
If a letter is found in the upper row, it can be sent using only the corresponding 4-bit value. If a letter appears in the second row, that code will be sent as an 8-bit value by prefixing hex 'E' (up). Note that the space (spc) character is actually the most common character in text, since every word has an accompanying space. Most text characters will require only four bits. At worst, it takes eight bits for each letter. Using 4-bit units simplifies storage and decompression since two symbols can be stored per byte. Not suprisingly, typical English text messages will make heavy use of the upper row in the table:
S  E  N  D     M  O  R  E     M  O  N  E  Y
6  0  4  A  D  5  B  8  0  D  5  B  4  0  EC
By writing out each hex value in binary, the transmitted message becomes:

where it will (later) be easy to extract the message from the binary stream by counting 4-bit groups. The final message has 64 bits, compared to fifteen 8-bit (ASCII) characters = 120 bits (including spaces). As expected, the compressed file will be reduced to almost one half the original message size.

The above method illustrates one way in which text can be compressed. It would not be as effective for other languages unless the table was redefined. As a compression scheme, it is based purely on character frequency, and does not exploit the repeated string patterns found in a dictionary. A simple enhancement will greatly improve this technique.

Compression of the Dictionary

A dictionary of the English language contains perhaps 100,000 words arranged in alphabetical order. With an average word length of 9 letters, this dictionary would require almost 1 MB of storage using ASCII codes. Using only the above method for compressing text, this file would still be more than one half its original size. On the other hand, in an alphabetical word list, consecutive entries generally share many letters. Consider the following alphabetically listed words:
Structure can be exploited
for further compression.

When applying the letter compression scheme to this dictionary listing, the space word 'D' will not be used. Rather, the unused code 'F' can be employed to mean end-of-word, where 'F' will always be followed by a single hex value denoting how many letters to keep from the previous word. For example, the code 'F6' says "this word has ended, and keep six letters for the next word". A single hex values limits to 15 the maximum letters retained..

chromate F7ic F8n F7ogram FBph FDy F5e F5ic F6um F5osome
Keeping unchanged letters saves repetition.

The remaining letters may now be coded using the original table:

 c hromateF7i cF8nF7ogramFB p hFD yF5eF5i cF6umF5osome
This enhancement significantly improves the compression performance. These 10 words including 86 characters have now been encoded using only 27 bytes. Based on this small sampling, words having an average length of 8.6 letters would require 2.7 bytes each. By extension, a 100,000 word dictionary file might be stored in less than 300 KB.

This method resembles techniques used for video compression. In a video or motion picture film, consecutive frames generally change very little (sometimes not at all) and by sending only the changes from one frame to the next (as in the dictionary words) a digitized video image may be compressed significantly. Such compression techniques make possible the digital transmission of television by satellite and the storage of full length movies on DVD media.

One cost of using this compression method is that random access to dictionary words is not possible; the file must be read from the beginning to reconstruct all the words. Also note that single bit errors could have a disastrous effect on decompression. Since compression, by definition, exploits and removes redundancies, the problem of error control is common to all compression schemes.

General Purpose Data Compression

The lessons learned in this example can be applied wherever data compression is required. More general-purpose compression techniques must be used for arbitrary data. One general-purpose method which uses character frequency statistics is the Microcom Network Protocol 5 (MNP5).