## Notes

If you’ve ever sent a large file to a friend, you may have compressed it into a zip archive like the one on this slide before doing so. Basically, file compression relies on using a smaller number of bits than normal to represent data.

There are many ways to compress data, but we’ll focus here on a method called **Huffman Coding**.

ASCII is a **fixed-length encoding** scheme in which files are decoded by looking at file blocks of constant length (8 bits). Huffman encodings on the other hand are** variable-length encodings**, meaning that each character can have a representation of different length. If we can somehow figure out which characters occur most often, and encode them with fewer than 8 bits, we will be able to save more space than ASCII. However, we will then have to code some of the less frequently occurring characters with more than 8 bits, but that’s okay as long as the net effect is a decrease in total number of bits used!

An important property of Huffman coding is that no bit representation for any of the characters is a prefix of any other character’s representation. This is called the **prefix property**, and an encoding scheme with the prefix property is said to be **immediately decodable**.

Why is the prefix property so important?

Consider the following example:

- If we use the coding on the left to encode “CAB”, we would get the bit-string 0101. However, presented with the bit-string 0101, there is no way to tell whether it represents “ABAB”, “CAB”, “ABC”, or “CC”.
- In contrast, the coding on the right is unambiguous. The bit-string “110010” can only represent “CAB” (confirm it for yourself!).

We will use a binary tree to construct Huffman encodings for each character. First, however, we must figure out how often each character occurs in the file by building a **frequency table**.

Test Yourself!

After Huffman encoding, which of the letters in this string will be represented by the shortest bit-string?

Next, we’ll create a **forest** of **trees**, where each tree is a single node containing a character and the frequency of that character.

Then, we combine trees until the forest contains only one tree.

While there’s still more than one tree:

- Remove the two trees from the forest that have the smallest frequencies in their roots.
- Create a new node, to be the root of a new tree with the two trees in step 1 as its children. The frequency of this parent node is set to the sum of the frequencies of its two children.
- Insert this new tree back into the forest.

While there’s still more than one tree:

- Remove the two trees from the forest that have the smallest frequencies in their roots.
- Create a new node, to be the root of a new tree with the two trees in step 1 as its children. The frequency of this parent node is set to the sum of the frequencies of its two children.
- Insert this new tree back into the forest.

While there’s still more than one tree:

- Remove the two trees from the forest that have the smallest frequencies in their roots.
- Create a new node, to be the root of a new tree with the two trees in step 1 as its children. The frequency of this parent node is set to the sum of the frequencies of its two children.
- Insert this new tree back into the forest.

While there’s still more than one tree:

- Remove the two trees from the forest that have the smallest frequencies in their roots.
- Create a new node, to be the root of a new tree with the two trees in step 1 as its children. The frequency of this parent node is set to the sum of the frequencies of its two children.
- Insert this new tree back into the forest.

Nice! We now only have one tree!

Notice that all the frequencies add up to a total of 1.0.

To convert each letter into its Huffman encoding, we traverse the tree.

- For the letter “E,” we step right just once, so “E”’s encoding is 1.
- For the letter “A,” we step left once and right once, so “A”’s encoding is 01.
- “D” is 001.

What are the Huffman representations of “C” and “B”?