Huffman Coding

Notes

If you’ve ever sent a large file to a friend, you may have compressed it into a zip archive like the one on this slide before doing so. Basically, file compression relies on using a smaller number of bits than normal to represent data.

There are many ways to compress data, but we’ll focus here on a method called Huffman Coding.

ASCII is a fixed-length encoding scheme in which files are decoded by looking at file blocks of constant length (8 bits). Huffman encodings on the other hand are variable-length encodings, meaning that each character can have a representation of different length. If we can somehow figure out which characters occur most often, and encode them with fewer than 8 bits, we will be able to save more space than ASCII. However, we will then have to code some of the less frequently occurring characters with more than 8 bits, but that’s okay as long as the net effect is a decrease in total number of bits used!

An important property of Huffman coding is that no bit representation for any of the characters is a prefix of any other character’s representation. This is called the prefix property, and an encoding scheme with the prefix property is said to be immediately decodable.

Why is the prefix property so important?

Consider the following example:

  • If we use the coding on the left to encode “CAB”, we would get the bit-string 0101. However, presented with the bit-string 0101, there is no way to tell whether it represents “ABAB”, “CAB”, “ABC”, or “CC”.
  • In contrast, the coding on the right is unambiguous. The bit-string “110010” can only represent “CAB” (confirm it for yourself!).

We will use a binary tree to construct Huffman encodings for each character. First, however, we must figure out how often each character occurs in the file by building a frequency table.

Test Yourself!

After Huffman encoding, which of the letters in this string will be represented by the shortest bit-string?

Next, we’ll create a forest of trees, where each tree is a single node containing a character and the frequency of that character.

Then, we combine trees until the forest contains only one tree.

While there’s still more than one tree:

  • Remove the two trees from the forest that have the smallest frequencies in their roots.
  • Create a new node, to be the root of a new tree with the two trees in step 1 as its children. The frequency of this parent node is set to the sum of the frequencies of its two children.
  • Insert this new tree back into the forest.

While there’s still more than one tree:

  • Remove the two trees from the forest that have the smallest frequencies in their roots.
  • Create a new node, to be the root of a new tree with the two trees in step 1 as its children. The frequency of this parent node is set to the sum of the frequencies of its two children.
  • Insert this new tree back into the forest.

While there’s still more than one tree:

  • Remove the two trees from the forest that have the smallest frequencies in their roots.
  • Create a new node, to be the root of a new tree with the two trees in step 1 as its children. The frequency of this parent node is set to the sum of the frequencies of its two children.
  • Insert this new tree back into the forest.

While there’s still more than one tree:

  • Remove the two trees from the forest that have the smallest frequencies in their roots.
  • Create a new node, to be the root of a new tree with the two trees in step 1 as its children. The frequency of this parent node is set to the sum of the frequencies of its two children.
  • Insert this new tree back into the forest.

Nice! We now only have one tree!

Notice that all the frequencies add up to a total of 1.0.

To convert each letter into its Huffman encoding, we traverse the tree.

  • For the letter “E,” we step right just once, so “E”’s encoding is 1.
  • For the letter “A,” we step left once and right once, so “A”’s encoding is 01.
  • “D” is 001.

What are the Huffman representations of “C” and “B”?

Slides ( / )

study50 slide
study50 slide
study50 slide
study50 slide
study50 slide
study50 slide
study50 slide
study50 slide
study50 slide
study50 slide

Frequency Analysis

Prerequisites:

Calculate character frequencies for the user-inputted string.

jharvard@run.cs50.net (~): ./a.out puppy
p = 3
u = 1
y = 1

Try out some pseudocode here!
#include <stdio.h>
#include <string.h>

#define SYMBOLS 256

int main(int argc, char* argv[])
{
    // TODO  
}

Encode

Prerequisites:

Write a program that prints out the Huffman encoding of this string:


ECEABEADCAEDEEEECEADEEEEEDBAAEABDBBAAEAAACDDCCEABEEDCBEEDEAEEEEEAEEDBCEBEEADEAEEDAEBCDEDEAEEDCEEAEEE


Given its Huffman tree:

tree


jharvard@run.cs50.net (~): ./a.out
100011010000101001000101100111110001101001111110010000010110100000010000000001011010101000100100100010001101000011001000100001100110111111011100100000001100001101001101110010110000000100110011011100100011101111

Try out some pseudocode here!
#include <stdio.h>
#include <stdlib.h>

#define SYMBOLS 256

int main(int argc, char* argv[])
{
    // TODO
}

Videos

study50 video thumbnail

Monday, Week 7

An introduction to Huffman Coding.