ECE4253 Digital Communications | |
Department of Electrical and Computer Engineering - University of New Brunswick, Fredericton, NB, Canada | |
This problem was identified by Samuel Morse in defining a very early digital coding scheme. In the Morse
Code, telegraph messages are sent using combinations of 'dots' and 'dashes'.
Morse reasoned that telegraph operators could work more efficiently if
the common letters could be sent more quickly than the others. In the table
below showing the International Morse Code, the letters 'E' and
'T' are sent as single symbols, while 'Z' and 'Q' and 'J' each require
four symbols to transmit. This code can be compared to a more recent character
code by noting that in ASCII the same number of bits are used for every
character.
A: .- B: -... C: -.-. D: -.. E: . F: ..-. G: --. H: .... I: .. J: ---. K: -.- L: .-.. M: -- N: -. O: --- P: .--. Q: --.- R: ._. S: ... T: - U: .._ V: ...- W: .-- X: -..- Y: -.-- Z: --.. |
1-BIT | 2-BITS | 3-BITS | 4-BITS |
E,T | I,A,N,M | S,U,R,W,D,O,G,K | H,V,F,L,P,J,B,X,C,Z,Q,Y |
The above scheme might be useful in a modern digital communications system, except that is difficult to identify the beginning and end of characters in a binary stream. In a telegraph system, pauses are inserted between characters to separate the codewords.
S E N D M O R E M O N E Y ... . -. _.. -- --- -.- . -- --- -. . -.-- 000 0 10 100 11 111 101 0 11 111 10 0 1011 |
In a binary data compression scheme, variable length codewords must be separated from each other. In the next example, the alphabet is sorted and a practical data compression scheme is described.
Note that the patterns present in English text go much deeper than single characters, and certain letter pairs ('TH', 'QU', 'CK' ) occur often enough that codes which recognized these patterns could be very efficient. Indeed, some words are so common that a few bits could encode entire strings of characters. In this example, only single letters will be coded.
Divide the alphabet into the 13 most common and the 13 least common letters. Using the Morse distribution as a guide:
HEX (4-bits) | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | A | B | C | D | E | F |
ROW 1 | E | T | I | A | N | M | S | U | R | W | D | O | G | spc | up | end |
ROW 2 | K | H | V | F | L | P | J | B | X | C | Z | Q | Y | .. | up | .. |
S E N D M O R E M O N E Y 6 0 4 A D 5 B 8 0 D 5 B 4 0 EC |
where it will (later) be easy to extract the message from the binary stream by counting 4-bit groups. The final message has 64 bits, compared to fifteen 8-bit (ASCII) characters = 120 bits (including spaces). As expected, the compressed file will be reduced to almost one half the original message size.
The above method illustrates one way in which text can be compressed. It would not be as effective for other languages unless the table was redefined. As a compression scheme, it is based purely on character frequency, and does not exploit the repeated string patterns found in a dictionary. A simple enhancement will greatly improve this technique.
chromate chromatic chromatin chromatogram chromatograph chromatography chrome chromic chromium chromosome |
When applying the letter compression scheme to this dictionary listing,
the space word 'D' will not be used. Rather, the unused code 'F' can be
employed to mean end-of-word, where 'F' will always be followed by a
single hex value denoting how many letters to keep from the previous
word. For example, the code 'F6' says "this word has ended, and keep
six letters for the next word". A single hex values limits to 15 the maximum
letters retained..
The remaining letters may now be coded using the original table:
123456789-123456789-123456789-123456789-123456789-1234 E9E18B5310F72E9F84F7BC835FBE5E1FDECF50F52E9F675F5B6B50 c hromateF7i cF8nF7ogramFB p hFD yF5eF5i cF6umF5osome |
This method resembles techniques used for video compression. In a video or motion picture film, consecutive frames generally change very little (sometimes not at all) and by sending only the changes from one frame to the next (as in the dictionary words) a digitized video image may be compressed significantly. Such compression techniques make possible the digital transmission of television by satellite and the storage of full length movies on DVD media.
One cost of using this compression method is that random access to dictionary words is not possible; the file must be read from the beginning to reconstruct all the words. Also note that single bit errors could have a disastrous effect on decompression. Since compression, by definition, exploits and removes redundancies, the problem of error control is common to all compression schemes.
Sat Apr 20 08:51:21 ADT 2024
Last Updated: 09 NOV 1998 |
Richard Tervo [ tervo@unb.ca ] | Back to the course homepage... |