Task 1 - Shannon-Fano encoding and Huffman encoding

Suppose we have the following alphabeth and probability model

Symbol Probability
a 0.35
b 0.17
c 0.17
d 0.16
e 0.15

a)

Find a Shannon-Fano code for this model, and give the resulting source code in table form. How many bits is used on average for each symbol after coding (if we do not consider the space needed to store the source code itself)?

b)

Find a Huffman code for this model, and give the resulting source code in table form. How many bits is used on average for each symbol after coding (if we do not consider the space needed to store the source code itself)?

c)

What symbol have a longer codeword in the Shannon-Fano source code than in the Huffman source code?

How can we, using the results from a) and b) conclude that it is best to use the shortest of these two codewords for this symbol, even after considering that other symbols need longer codewords?

d)

Use the knowledge from c) about which symbol is given a too long codeword in the Shannon-Fano code, together with the understanding of why this method does this “error” (compared to creating an optimal code), to create a probability model where the discrepancy between the average bits per symbol after Shannon-Fano coding and after Huffman-coding is even larger than in the model we started the task with. Compute what the average number of bits per symbol for this model after Shannon-Fano coding and Huffman coding.

Task 2 - Huffman decoding

We have the following source code

Symbol Codeword
^ 00101
D 00100
a 101
b 100
d 111
e 110
g 0101
h 00111
i 000
l 011
n 0100
t 00110

where ^ represent a whitespace. Use the source code to decode the following bit stream

001000000101000001101010110010110000001111111010011000111101010011101100001000101

Hint: Represent the source code as a binary tree. Start to the left of the bit stream and traverse down in the binary tree while reading bit by bit. When you have reached a leaf node, register which symbol this is, and start again from the root of the binary tree. Repeat this until the whole bit stream is traversed.

PS: The source code is from a Huffman code

Task 3 - Hufman coding without coding redundancy

Find the source code of a Huffman code of the message

DIGITAL OVERALT!

How (and why) can we find the entropy for this message without doing any logarithm computations?

Compute the entropy (without any logarithms) and veryfy the answer by computing the entropy using the entropy definition.

Task 4 - Implementation of Huffman encoding

Implement a program that encodes a message using Huffman coding and the histogram of the symbols in the message.

Task 5 - Aritmetisk coding, Huffman coding and entropy

We have the following message

adcdcdddba

a)

Compute mean bits per symbol after a Huffman encoding of this message.

b)

(Obs.: Delopppgaven er noe regnekrevende) Beregn den aritmetiske koden av denne meldingen etterfulgt av symbolet END-OF-DATA (EOD). Du skal bruke følgende modell:

Compute the arithmetic encoding of this message, followed by a END-OF-DATA (EOD) symbol. Use the following model

Symbol Probability
a 2/11
b 1/11
c 2/11
d 5/11
EOD 1/11

c)

What is the average number of bits per symbol after the arithmetic coding?

d)

The entropy for this message is about 1.73. Why can the entropy be higher than the mean number of bits per symbol after arithmetic coding?

e)

If we had generated many different messages that each contained the same symbols with the same occurance frequency as above, i.e. each message is a random permutation of the original message, do you think that the coding redundancy after the arithmetic coding, on average, had been positive or negative? Explain.

Note that since each message has the same histogram, both the entropy and the mean bits per symbol after Huffman encoding will be identical to that of the original message.

Task 6 - Implemtation of arithmetic coding

Implement a program that uses arithmetic coding to encode a sequence of symbols. Use a predefined probability model, or use the histogram of the symbols in the message. This program shall first find the intervals, and then the shortest possible binary representation of a number contained in each interval.

You can decide yourself how you want to implement the end of message problem. A common solution is to include a EOD symbol (but remember that this also needs to be provided with a probability).

Tips, high precision float arithmetic:

Matlab:

http://www.mathworks.se/help/symbolic/sym.html

Python:

https://docs.python.org/3/library/decimal.html