In Dictionary Compression, fewer bits are used for letters that are expected more frequently, yet each letter still has its own codeword. In contrast, arithmetic encoding conceptually represents an entire message as a single real number in the range (0,1) and having potentially many decimal places.
message = [ 3 3 1 3 3 3 3 3 2 3 ]
Using 2 bits per symbol implicitly allows four symbols {0,1,2,3}. Already this is inefficient for this message since the symbol {0} is never used. On the other hand, there are 310 = 50949 possible ten symbol messages using only the set {1,2,3}, which suggests that 16-bits are sufficient to represent all possible ten symbol messages (because 216 = 65536). However, if the three symbols {1,2,3} do not occur with equal frequency, then further optimization is possible.
For example, if it is known that all 10-symbol messages contain exactly eight {3} and one {2} and one {1} and zero {0} then there are only 90 possible symbol combinations and only 90 possible messages (coded in 7 bits since 27=128). More realistically, if statistics show that in typical messages the symbol {3} occurs with a probability of 0.8, this information can be used to reduce the number of bits required to represent a similar message of any length.
By examining the sequence of symbols with respect to an expected frequency distribution of the symbols {1,2,3}, arithmetic encoding will reduce the above message to a real number in the range (0,1) as:
0.40576171875 <===> [ 3 3 1 3 3 3 3 3 2 3 ]
In practical terms, the message is encoded as the bits representing the binary fraction as shown below. While this potentially requires a very large number of bits, most messages are completely represented by only the first (few) bits and it happens that in this example only 12 bits are necessary.
0.40576171875 = 0 . 0 1 1 0 0 1 1 1 1 1 1 0
Furthermore, if the range (0,1) is mapped to a 12-bit range in binary such as (0x0000,0x0FFF) or (0,4095) in decimal, then the real number 0.40576171875 becomes the integer 1662 = 0x067E.
1662 = 0 1 1 0 0 1 1 1 1 1 1 0 = 0x067E
In the following examples, the above message will be coded as a single 12-bit value and then decoded back to the original symbols.
Beginning with the 12-bit range (0x0000, 0x0FFF) in hexadecimal or (0,4095) in decimal, divide the range according to the expected ratios, as:
0 | 0.1 | 409 | 0.1 | 819 | 0.8 | 4095 |
Then, label each range with the corresponding symbol {1,2,3} as:
0 | [1] | 409 | [2] | 819 | [3] | 4095 |
Now, the first message symbol [3] is shown highlighted, indicating that the final encoded message will be a number between 819 and 4095. Further message symbols will narrow this range to a single value that will represent the entire message.
Next, repeat the procedure for the range (819,4095) giving:
819 | [1] | 1146 | [2] | 1474 | [3] | 4095 |
The second mesage symbol [3] refines the result to a number between 1474 and 4095.
Next, consider the range from (1474,4095) identified above and repeat the same procedure, giving:
1474 | [1] | 1736 | [2] | 1998 | [3] | 4095 |
The third symbol [1] further defines the encoded message as some number between 1474 and 1736. Any number in this range would represent the first three message characters [3,3,1]. The process continues until the entire message reduces to a single number, in this case 1662.
In the next section, the entire message is extracted from the value 1662 and the knowledge of the expected symbol distribution {1/10,1/10,8/10}.
Beginning with the 12-bit range (0x0000, 0x0FFF) in hexadecimal or (0,4095) in decimal, divide the range according to the expected ratios, as:
0 | 0.1 | 409 | 0.1 | 819 | 0.8 | 4095 |
Then, assign the corresponding symbol {1,2,3} to each of the ranges as:
0 | [1] | 409 | [2] | 819 | [3] | 4095 |
Observe that the message value 1662 lies in the range of [3] as shown highlighted. This is also the first symbol in the message.
Next, consider the range (819,4095) identified above and repeat the same procedure, giving:
819 | [1] | 1146 | [2] | 1474 | [3] | 4095 |
Again, the message value 1662 lies in the range of [3] as shown highlighted. This is the second symbol in the message.
Next, consider the range (1474,4095) identified above and repeat the same procedure, giving:
1474 | [1] | 1736 | [2] | 1998 | [3] | 4095 |
In this case, the message value 1662 lies in the range of [1] as shown highlighted. This is the third symbol in the message.
The process continues until the entire message is extracted giving the message = [3 3 1 3 3 3 3 3 2 3].
0 | [1] | 409 | [2] | 819 | [3] | 4095 |
819 | [1] | 1146 | [2] | 1474 | [3] | 4095 |
1474 | [1] | 1736 | [2] | 1998 | [3] | 4095 |
1474 | [1] | 1500 | [2] | 1526 | [3] | 1736 |
1526 | [1] | 1547 | [2] | 1568 | [3] | 1736 |
1568 | [1] | 1584 | [2] | 1601 | [3] | 1736 |
1601 | [1] | 1614 | [2] | 1628 | [3] | 1736 |
1628 | [1] | 1638 | [2] | 1649 | [3] | 1736 |
1649 | [1] | 1657 | [2] | 1666 | [3] | 1736 |
1657 | [1] | 1657 | [2] | 1658 | [3] | 1666 |
>>> | 1662 | <<< |
From the table, the message symbols lead to the value 1662 and vice versa while the expected frequency statistics {1/10,1/10,8/10} are reflected in the way each line is subdivided. The final value 1662 was chosen to lie midway in the range 1659 to 1666 and any of these values such as 1660 would represent this same message.
The best compression is achieved when the actual symbols match the expected frequency distribution as in this case. Every time a symbol is encountered, its corresponding range is subdivided. When encoding the message [ 3 3 1 3 3 3 3 3 2 3 ] the range for {3} is subdivided eight times; this works here because a relatively wide range for {3} was chosen at the outset, based on the expected occurences of {3}.
In the above table, there is no space for any more symbols. The remaining ranges for {1} and {2} are non-existent and {3} is almost at its limit. For longer messages or for messages having any more occurences of the symbols {1} or {2}, more bits are required and the range of the table values will widen accordingly.
It is not necessary that the message symbol distribution exactly matches the expected frequency statistics and there generally will not be an exact match (as there is here where {3} occurs exactly 8/10 times). However, if the message symbol occurences are very different than the expected stastistics, arithmetic coding can require more bits than the original message. For very long messages, the statistics could be updated dynamically based on observed symbol counts.
The message is encoded into the bits tx by providing the symbol statistics {1/10,1/10,8/10} in the form of a vector counts. In this case, a typical message of 1000 characters is expected to contain the symbols {1,2,3} in the fractions [100 100 800] as shown.
>> counts = [ 100 100 800 ]; % expected symbol frequency >> message = [ 3 3 1 3 3 3 3 3 2 3 ]; % input message >> tx = arithenco(message,counts); % tx = encoded binary message >> disp( tx(1:12) ) % examine the first 12 bits 0 1 1 0 0 1 1 1 1 1 0 0 >> bi2de( tx(1:12), 'left-msb') % the 12-bits in decimal ans = 1660The transmitted message tx is decoded using the same symbol statistics as:
>> counts = [ 100 100 800 ]; % expected symbol frequency >> arithdeco(tx,counts,10) % decode tx ans = 3 3 1 3 3 3 3 3 2 3Different outcomes for tx may be expected for a smaller sample space such as counts = [10 10 80] due to the specifics of the optimized integer algorithm that is used by MATLAB.
Reference: Sayood, Khalid, Introduction to Data Compression, San Francisco, Morgan Kaufmann, 2000.