Arithmetic Coding

Arithmetic coding exploits the frequency statistics of a set of source symbols to efficiently represent a message in far fewer bits than would seem necessary. This method is especially effective for messages in which certain letters (symbols) are expected to occur far more frequently than others.

In Dictionary Compression, fewer bits are used for letters that are expected more frequently, yet each letter still has its own codeword. In contrast, arithmetic encoding conceptually represents an entire message as a single real number in the range (0,1) and having potentially many decimal places.

A Structured Data Set

Consider the set of three symbols {1,2,3}. Normally, each symbol might be coded as a 2-bit value and a message containing ten symbols such as shown below could be transmitted using 20 bits.

  message = [ 3 3 1 3 3 3 3 3 2 3 ]

Using 2 bits per symbol implicitly allows four symbols {0,1,2,3}. Already this is inefficient for this message since the symbol {0} is never used. On the other hand, there are 3¹⁰ = 50949 possible ten symbol messages using only the set {1,2,3}, which suggests that 16-bits are sufficient to represent all possible ten symbol messages (because 2¹⁶ = 65536). However, if the three symbols {1,2,3} do not occur with equal frequency, then further optimization is possible.

For example, if it is known that all 10-symbol messages contain exactly eight {3} and one {2} and one {1} and zero {0} then there are only 90 possible symbol combinations and only 90 possible messages (coded in 7 bits since 2⁷=128). More realistically, if statistics show that in typical messages the symbol {3} occurs with a probability of 0.8, this information can be used to reduce the number of bits required to represent a similar message of any length.

By examining the sequence of symbols with respect to an expected frequency distribution of the symbols {1,2,3}, arithmetic encoding will reduce the above message to a real number in the range (0,1) as:


  0.40576171875  <===> [ 3 3 1 3 3 3 3 3 2 3 ]

In practical terms, the message is encoded as the bits representing the binary fraction as shown below. While this potentially requires a very large number of bits, most messages are completely represented by only the first (few) bits and it happens that in this example only 12 bits are necessary.


 0.40576171875 = 0 . 0 1 1 0 0 1 1 1 1 1 1 0

Furthermore, if the range (0,1) is mapped to a 12-bit range in binary such as (0x0000,0x0FFF) or (0,4095) in decimal, then the real number 0.40576171875 becomes the integer 1662 = 0x067E.


 1662 = 0 1 1 0 0 1 1 1 1 1 1 0 = 0x067E

In the following examples, the above message will be coded as a single 12-bit value and then decoded back to the original symbols.

Arithmetic Coding - Compression Example

Consider the message [3 3 1 3 3 3 3 3 2 3] and assume that the symbols {1,2,3} generally occur in the ratios {1/10,1/10,8/10}. This message will be encoded as a 12-bit value on the range (0,4095).

Beginning with the 12-bit range (0x0000, 0x0FFF) in hexadecimal or (0,4095) in decimal, divide the range according to the expected ratios, as:

0.1

409

0.1

819

0.8

4095

Then, label each range with the corresponding symbol {1,2,3} as:

[1]

409

[2]

819

[3]

4095

Now, the first message symbol [3] is shown highlighted, indicating that the final encoded message will be a number between 819 and 4095. Further message symbols will narrow this range to a single value that will represent the entire message.

Next, repeat the procedure for the range (819,4095) giving:

819

[1]

1146

[2]

1474

[3]

4095

The second mesage symbol [3] refines the result to a number between 1474 and 4095.

Next, consider the range from (1474,4095) identified above and repeat the same procedure, giving:

1474

[1]

1736

[2]

1998

[3]

4095

The third symbol [1] further defines the encoded message as some number between 1474 and 1736. Any number in this range would represent the first three message characters [3,3,1]. The process continues until the entire message reduces to a single number, in this case 1662.

In the next section, the entire message is extracted from the value 1662 and the knowledge of the expected symbol distribution {1/10,1/10,8/10}.

Arithmetic Coding - Decompression Example

Consider the 10 symbol message encoded as 0.40576172 for which the symbols {1,2,3} generally occur in the ratios {1/10,1/10,8/10}. On a 12-bit range (0,4095) this message is 4096 × 0.40576172 = 1622. This value will be decoded back to the original ten symbol message.

Beginning with the 12-bit range (0x0000, 0x0FFF) in hexadecimal or (0,4095) in decimal, divide the range according to the expected ratios, as:

0.1

409

0.1

819

0.8

4095

Then, assign the corresponding symbol {1,2,3} to each of the ranges as:

[1]

409

[2]

819

[3]

4095

Observe that the message value 1662 lies in the range of [3] as shown highlighted. This is also the first symbol in the message.

Next, consider the range (819,4095) identified above and repeat the same procedure, giving:

819

[1]

1146

[2]

1474

[3]

4095

Again, the message value 1662 lies in the range of [3] as shown highlighted. This is the second symbol in the message.

Next, consider the range (1474,4095) identified above and repeat the same procedure, giving:

1474

[1]

1736

[2]

1998

[3]

4095

In this case, the message value 1662 lies in the range of [1] as shown highlighted. This is the third symbol in the message.

The process continues until the entire message is extracted giving the message = [3 3 1 3 3 3 3 3 2 3].

Summary of Coding / Decoding

The table below summarizes the above examples for the entire ten symbol message = [3 3 1 3 3 3 3 3 2 3].

0	[1]	409	[2]	819	[3]	4095
819	[1]	1146	[2]	1474	[3]	4095
1474	[1]	1736	[2]	1998	[3]	4095
1474	[1]	1500	[2]	1526	[3]	1736
1526	[1]	1547	[2]	1568	[3]	1736
1568	[1]	1584	[2]	1601	[3]	1736
1601	[1]	1614	[2]	1628	[3]	1736
1628	[1]	1638	[2]	1649	[3]	1736
1649	[1]	1657	[2]	1666	[3]	1736
1657	[1]	1657	[2]	1658	[3]	1666
				>>>	1662	<<<

From the table, the message symbols lead to the value 1662 and vice versa while the expected frequency statistics {1/10,1/10,8/10} are reflected in the way each line is subdivided. The final value 1662 was chosen to lie midway in the range 1659 to 1666 and any of these values such as 1660 would represent this same message.

The best compression is achieved when the actual symbols match the expected frequency distribution as in this case. Every time a symbol is encountered, its corresponding range is subdivided. When encoding the message [ 3 3 1 3 3 3 3 3 2 3 ] the range for {3} is subdivided eight times; this works here because a relatively wide range for {3} was chosen at the outset, based on the expected occurences of {3}.

In the above table, there is no space for any more symbols. The remaining ranges for {1} and {2} are non-existent and {3} is almost at its limit. For longer messages or for messages having any more occurences of the symbols {1} or {2}, more bits are required and the range of the table values will widen accordingly.

It is not necessary that the message symbol distribution exactly matches the expected frequency statistics and there generally will not be an exact match (as there is here where {3} occurs exactly 8/10 times). However, if the message symbol occurences are very different than the expected stastistics, arithmetic coding can require more bits than the original message. For very long messages, the statistics could be updated dynamically based on observed symbol counts.

MATLAB Example

Arithmetic coding is accomplished directly using the MATLAB Communications Toolbox

The message is encoded into the bits tx by providing the symbol statistics {1/10,1/10,8/10} in the form of a vector counts. In this case, a typical message of 1000 characters is expected to contain the symbols {1,2,3} in the fractions [100 100 800] as shown.

>> counts  = [ 100 100 800 ];          % expected symbol frequency
>> message = [ 3 3 1 3 3 3 3 3 2 3 ];  % input message
>> tx = arithenco(message,counts);     % tx = encoded binary message
>> disp( tx(1:12) )                    % examine the first 12 bits 

    0   1   1   0   0   1   1   1   1   1   0   0   

>> bi2de( tx(1:12), 'left-msb')        % the 12-bits in decimal

ans =
     1660

The transmitted message tx is decoded using the same symbol statistics as:

>> counts  = [ 100 100 800 ];          % expected symbol frequency
>> arithdeco(tx,counts,10)             % decode tx 

ans =
     3 3 1 3 3 3 3 3 2 3

Different outcomes for tx may be expected for a smaller sample space such as counts = [10 10 80] due to the specifics of the optimized integer algorithm that is used by MATLAB.

Reference: Sayood, Khalid, Introduction to Data Compression, San Francisco, Morgan Kaufmann, 2000.