Huffman Coding Demonstration

This is a demonstration of Huffman coding. Huffman coding is an algorithm for creating codes for each character in a message. The goal is to give the shortest codes to the most frequently used characters. That way, the coded version of the message is as short as possible. This is extremely important for lossless compression.

Example

Let's say we wanted to encode the message "Mississippi", giving each letter a binary code (a code made of ones and zeros). The message has 4 different characters: M, i, s, and p. Therefore, we will need at least 2 bits (a bit is a 1 or 0) per character, because there are 4 possible codes you could make from 2 bits: 00, 01, 10, and 11. Let's say we came up with the following encoding: M = 00, i = 01, s = 10, and p = 11. Then, the encoded message would look like this:

Mississippi = 00 01 10 10 01 10 10 01 11 11 01 = 0001101001101001111101

Since we know each code is 2 bits, we don't have to put spaces between each code to tell them apart. This is important because in practice the spaces would themselves have to be represented by a code, which would greatly increase the length of the message. Using this encoding without spaces, the encoded message comes out to 22 bits (11 letters * 2 bits per letter).

This works, but we can do better. We know that the letters "i" and "s" appear a lot more often in the message than "p" or "M". If we could give the frequent letters shorter codes, the overall effect would be to decrease the size of the encoded message, since shorter codes would appear more often. Here is how you might initially try to do this: i = 0, s = 1, M = 00, p = 01. Great! The frequent letters have shorter codes. Let's see the result of using this coding:

Mississippi = 00 0 1 1 0 1 1 0 01 01 0 = 00011011001010

Unfortunately, we run into a problem here. The version without the spaces, which is the version we actually care about, is ambiguous. Look at those first two bits, 00. Without spaces, there's no way to tell whether a 00 represents an "M" or two "i"s. Therefore, we need to come up with a coding that makes the common letters frequent, but is also unambiguous. Let's start over.

First, let's try to make something work with the coding i = 0. We could try to make s = 1, but then we run into the same problem. Whatever codes we come up with for M and p will be made of multiple 1s and 0s, which could be interpreted as a series of i and s, so there would be ambiguity. So, the code for "s" cannot be 1 (or zero, since that is taken). In fact, to avoid ambiguity, let's make a rule: the code for a character cannot appear in the beginning of the code for another character. Since i = 0, that means all the other codes have to start with 1. So, let's say that s = 10. Following our rule, that means the codes for M and p cannot start with 0 or 10. From that you can deduce they must start with 11. From this, we can finish our code by saying M = 110 and p = 111. Our finished code is therefore i = 0, s = 10, M = 110 and p = 111. Here's the result using this coding:

Mississippi = 110 0 10 10 0 10 10 0 111 111 0 = 110010100101001111110

If you go through the coded version without spaces, it becomes clear that there is no ambiguity. Since no code begins with a different code, you always know when you've reached the end of a code. This version ends up being 21 bits long, as opposed to the 22 bits using the default encoding. We have shortened the coded version of the message using a more rational encoding.

That is essentially what a Huffman code is. It is a coding for a message which creates the shortest possible coded version by giving the shortest codes to the most frequent letters and not letting any codes begin with any other codes. What makes Huffman codes even nicer is that there is a fast algorithm to generate them from a message. This is very useful for data compression. Let's say I had a text file with the word "Mississippi" typed 10,000 times. The size of that file would be: 10,000 words * 11 letters per word * at least 2 bits per letter = at least 220,000 bits. Then, let's say I used Huffman coding to compress that file. We know from the earlier example that each "Mississippi" would become 21 bits. The size of the compressed version would then be: 10,000 words * 21 bits per word + about 100 bits to store the coding = about 210,100 bits. We saved about 9,900 bits of space using Huffman coding. Most text files would have much greater space savings due to the infrequency of some letters.

As a final note, I should point out that Huffman codes only work well for messages which have some characters that are more frequent than others. If all characters appeared the same amount of the time, giving certain characters shorter codes would offer no advantage. Thankfully, pretty much all real world files have unbalanced character distributions, which means they can be compressed losslessly using Huffman codes.

Huffman Trees

So, what is the tree that the demo shows you? That is called a Huffman tree, and it is a way of visualizing the Huffman code of a message. You may notice that each "leaf" of the tree contains a letter, with one leaf for each letter that appears in the message. To get the code for that letter, start at the "root" at the top of the tree (computer scientists draw trees upside-down) and work your way down to the letter in question. Every time you go right, add a 1 to the code. Every time you go left, add a 0 to the code. Once you reach the leaf, you have spelled out the full code for the letter in question. As a result of this representation, the frequent letters with short codes appear closer to the top of the screen near the root.

Why show the Huffman code in this way? First of all, it is more interesting than just seeing a list of codes. You can also see that the rule "no code can begin with another code" results in the nice property that each letter ends up at a leaf. More importantly, though, the tree is closely related to the algorithm which generates Huffman codes. I won't describe it here, but this video does a good job of explaining it. Essentially, the tree is built up from the letters using their frequencies as a guide. As another note, the numbers you see in the non-leaf nodes of the tree represent how many times any of the letters below it appear in the message. This is also an important component of the algorithm which creates the tree.

That's essentially it: a basic description of Huffman codes and Huffman trees. If you haven't already, give the above demo a try and see how the Huffman tree changes with the message.