Practical Data Compression
Successful data compression depends on the presence of structure or
patterns in the file to be compressed. Files containing English text as
ASCII characters possess many features which allow compression. Moreover,
an alphabetical listing of English words (such as might be used by a spell
checking program) has structure which begs for compression. In this section,
it will be shown that a dictionary listing English words can easily be
compressed to less than 1/3 its original file size.
Compression of English Text
English text is distinguishable from random characters by the fact that
certain letters occur much more frequently than others. (This is reflected
in the points assigned to various letters in certain board games.) This observation
can form the basis for a compression scheme in which letters are represented
by variable length codewords, where the shortest codewords are used for
the most frequently occurring characters.
This problem was identified by Samuel Morse in defining a very early digital coding scheme. In the Morse
Code, telegraph messages are sent using combinations of 'dots' and 'dashes'.
Morse reasoned that telegraph operators could work more efficiently if
the common letters could be sent more quickly than the others. In the table
below showing the International Morse Code, the letters 'E' and
'T' are sent as single symbols, while 'Z' and 'Q' and 'J' each require
four symbols to transmit. This code can be compared to a more recent character
code by noting that in ASCII the same number of bits are used for every
character.
The Morse Code = Digital Compression
A: .- B: -... C: -.-. D: -.. E: .
F: ..-. G: --. H: .... I: .. J: ---.
K: -.- L: .-.. M: -- N: -. O: ---
P: .--. Q: --.- R: ._. S: ... T: -
U: .._ V: ...- W: .-- X: -..- Y: -.-- Z: --..
|
The code used by Morse illustrates a fundamental principle of data compression.
Here, different codewords are assigned to letters such that the shortest
codewords are assigned to the most frequently occurring letters. If the
dots and dashes are written as binary 1's and 0's, then these alphabet
characters can be represented using between 1 and 4 bits. There are two
1-bit codewords {E,T}, four 2-bit codewords {I,A,N,M}, eight 3-bit codewords
{S,U,R,W,D,K,G,O}, and sixteen 4-bit codewords, of which only twelve are
needed to encode the remaining alphabet words, as shown below:
1-BIT |
2-BITS |
3-BITS |
4-BITS |
E,T |
I,A,N,M |
S,U,R,W,D,O,G,K |
H,V,F,L,P,J,B,X,C,Z,Q,Y |
Fewer bits are used for the most common letters.
The above scheme might be useful in a modern digital communications
system, except that is difficult to identify the beginning and end of characters
in a binary stream. In a telegraph system, pauses are inserted between
characters to separate the codewords.
S E N D M O R E M O N E Y
... . -. _.. -- --- -.- . -- --- -. . -.--
000 0 10 100 11 111 101 0 11 111 10 0 1011
|
unfortunately, as a binary stream, the message "00001010011111100111111010111111001011"
cannot be separated into characters.
In a binary data compression scheme, variable length codewords must
be separated from each other. In the next example, the alphabet is sorted
and a practical data compression scheme is described.
Note that the patterns present in English text go much deeper than
single characters, and certain letter pairs ('TH', 'QU', 'CK' ) occur often
enough that codes which recognized these patterns could be very efficient.
Indeed, some words are so common that a few bits could encode entire strings
of characters. In this example, only single letters will be coded.
Compression with "4 of 8" Coding
English text can be compressed because of the structure inherent in
words and sentences. Certain characters occur much more frequently than
others {such as E,T}, while some characters are used only rarely {Q,Z}.
Divide the alphabet into the 13 most common and the 13 least common
letters. Using the Morse distribution as a guide:
HEX (4-bits) |
0 |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
A |
B |
C |
D |
E |
F |
ROW 1 |
E |
T |
I |
A |
N |
M |
S |
U |
R |
W |
D |
O |
G |
spc |
up |
end |
ROW 2 |
K |
H |
V |
F |
L |
P |
J |
B |
X |
C |
Z |
Q |
Y |
.. |
up |
.. |
If a letter is found in the upper row, it can be sent using only the corresponding
4-bit value. If a letter appears in the second row, that code will be sent
as an 8-bit value by prefixing hex 'E' (up). Note that the space (spc)
character is actually the most common character in text, since every word
has an accompanying space. Most text characters will require only four
bits. At worst, it takes eight bits for each letter. Using 4-bit units
simplifies storage and decompression since two symbols can be stored per
byte. Not suprisingly, typical English text messages will make heavy use
of the upper row in the table:
S E N D M O R E M O N E Y
6 0 4 A D 5 B 8 0 D 5 B 4 0 EC
|
By writing out each hex value in binary, the transmitted message becomes:
0110000001001010110101011011100000001101010110110100000011101100
where it will (later) be easy to extract the message from the binary
stream by counting 4-bit groups. The final message has 64 bits, compared
to fifteen 8-bit (ASCII) characters = 120 bits (including spaces). As expected,
the compressed file will be reduced to almost one half the original message
size.
The above method illustrates one way in which text can be compressed.
It would not be as effective for other languages unless the table was redefined.
As a compression scheme, it is based purely on character frequency, and
does not exploit the repeated string patterns found in a dictionary. A
simple enhancement will greatly improve this technique.
Compression of the Dictionary
A dictionary of the English language contains perhaps 100,000 words
arranged in alphabetical order. With an average word length of 9 letters,
this dictionary would require almost 1 MB of storage using ASCII codes.
Using only the above method for compressing text, this file would still
be more than one half its original size. On the other hand, in an alphabetical
word list, consecutive entries generally share many letters. Consider the
following alphabetically listed words:
chromate
chromatic
chromatin
chromatogram
chromatograph
chromatography
chrome
chromic
chromium
chromosome
|
Structure can be exploited
for further compression.
When applying the letter compression scheme to this dictionary listing,
the space word 'D' will not be used. Rather, the unused code 'F' can be
employed to mean end-of-word, where 'F' will always be followed by a
single hex value denoting how many letters to keep from the previous
word. For example, the code 'F6' says "this word has ended, and keep
six letters for the next word". A single hex values limits to 15 the maximum
letters retained..
chromate F7ic
F8n F7ogram
FBph FDy
F5e F5ic
F6um F5osome
Keeping unchanged letters saves repetition.
The remaining letters may now be coded using the original table:
123456789-123456789-123456789-123456789-123456789-1234
E9E18B5310F72E9F84F7BC835FBE5E1FDECF50F52E9F675F5B6B50
c hromateF7i cF8nF7ogramFB p hFD yF5eF5i cF6umF5osome
|
This enhancement significantly improves the compression performance. These
10 words including 86 characters have now been encoded using only 27
bytes. Based on this small sampling, words having an average length
of 8.6 letters would require 2.7 bytes each. By extension, a 100,000 word
dictionary file might be stored in less than 300 KB.
This method resembles techniques used for video compression. In a video
or motion picture film, consecutive frames generally change very little
(sometimes not at all) and by sending only the changes from one
frame to the next (as in the dictionary words) a digitized video image
may be compressed significantly. Such compression techniques make possible
the digital transmission of television by satellite and the storage of
full length movies on DVD media.
One cost of using this compression method is that random access to dictionary
words is not possible; the file must be read from the beginning to reconstruct
all the words. Also note that single bit errors could have a disastrous
effect on decompression. Since compression, by definition, exploits and
removes redundancies, the problem of error control is common to all compression
schemes.
General Purpose Data Compression
The lessons learned in this example can be applied wherever data compression is required.
-
To be compressed, data must possess some inherent patterns or redundancies.
English text fits this description.
-
The more structure, the more compression can be expected. The dictionary
listing showed this well.
-
Some compression algorithms will work well for some data, not for others.
This method (with the same character table) would be less efficient for languages other than English.
It would be unsuitable for general-purpose compression.
-
Compressed data is more susceptible to bit errors. A single bit error
can have far-reaching consequences.
More general-purpose compression techniques must be used for arbitrary
data. One general-purpose method which uses character frequency statistics
is the Microcom Network Protocol 5 (MNP5).