Huffman Coding

Back to the file listing

The Huffman Encoding Scheme
The huffman encoding scheme is an information-theoretic lossless compression routine. It requires two passes through the data in order to compress it. The first pass works out what are the most frequently occurring patterns in the data, and sorts them accordingly. So, if you're endoding a piece of text, you may want to encode the letters, and so you'd do a first pass through the text, and you'd end up with some data like this: So, if you were encoding this text, you could stick to all eight bits of the ASCII alphabet, but that's a waste, because you only need to represent 26 letters, and 8 bits gives you 255. So, you could choose (say) 5 bits, which would allow you 32 letters. This reduces the size down to 5/8ths of its size.

However, if we look at the data, most of it is 'e's and 't's and so on. It'd be nice if we could represent them with something quick like "1" or "011". It's not so bad if q is a full 8-bits, because it only appears a few times anyway.

This is how the huffman scheme works. It assigns variable-length codes to the data, such that the more often-encountered data is given a short code, and the less often data is given a longer code.

The Huffman scheme can produce such amazing improvements that it is what's used in FAX machines. Seriously! Because most things you fax are a whole pile of white pixels with the odd black pixel, the FAX machine has been tuned to send long runs of white as only a few bytes, and then common white/black pairs are made into short bitstrings. FAX machines also add some extra padding to the edges to ensure that loss during transmission doesn't screw your whole document. If you think about it, all you have to lose is one bit, and you can't reconstruct from then on!

The Gotcha
The big gotcha is "How do I know when I've stopped reading one character, and started reading the next?". Well, let's look at it this way. Let's say we encode 'e' as zero and 't' as one.. then the bit stream 0101 is obviously "etet". but what if we encode "n" as "01", then is 0101 "etet" or is it "nn" ?

We could add padding. If we say "we will add 11 between all characters, then it becomes easier.. "0101" becomes "0 11 1 11 0 11 1 11", which we read as "e STOP t STOP e STOP t STOP", but we still have the problem that it could also be read as 01 11 1 1 0 11 1 11 "n STOP t !? e STOP t STOP".. note that it all went wrong because it was expecting a STOP..

But we've now expanded our short coding scheme to something twice as long, and it's still ambiguous.

How Huffman gets round the gotcha - Binary Trees!
Well, the huffman scheme to get round this is to use BINARY TREES. A binary tree is a structure made from nodes and links. Each node can have two links coming off it, joined to two other nodes 'below'. It's called 'binary', because of the two-children-per-node aspect. There are trinary trees (three-children) and n-ary trees (n children), but they're not used here.

The Scheme 'builds' a tree from the raw data by combining the two nodes with the smallest occurrence, and putting them under a new node whose 'occurrence' is the sum of the two childrens' occurrences. The two original children are then removed from consideration. So, with the above data, we would take 'z' and 'q' and make a new node with them as children. We could call it 'zq'. Because of the naff formatting here, I'll make the notation for a node NODENAME

It's 'occurrence' would be the sum of z's and q's which is 20. We then take 'zq' and 'x', which are the two smallest nodes and make a new node 'zqx' which has sum 38..

So we now have:

and so on... the next would be merging X and V because they are smaller than 38.. and so on.. Eventually, you'll end up with the top node whose id would be ETAOSRNIHDLUPCWMFGBKYJVXZQ, and whose occurrence would be the sum of all the letters, since it has every letter beneath it.

How the Bejesus does this help?
Well, to cut a long story short, the way that you build the tree means that if you take any bitstring and, starting at the root, go 'right' on a zero, or 'left' on a one (or vice versa, it's up to you), when you get to a point where you're dropping off the tree, you've reached the end of a correctly encoded character.

Because you're building the tree from the bottom-up, the more frequently encoded letters would be at the top of the tree, with the less frequent at the bottom. To encode the document, you convert each letter to the bitstring path you have to traverse down the tree to get to it. So 'e' might be '0', while 't' might be '10', 'a' may be '110' and so on.

Once you've done this, it's usually just necessary to store the letters and their frequencies somewhere, and then encode the document. When you want to decode, you rebuild the tree as above, and do the reverse process, turning the bitstrings back to letters.

You're still talking gibberish. We're not coding strings, dammit!
I know we're not coding strings, we're coding images, but the above information applies nonetheless. In this case the data is pixel-pairs. We take pixel pairs because EGA graphics use 1/2-bytes to store pixels, so a pair of them is one byte.

You just preprocess the image to find the most commonly-occurring pixel pairs, and sort them, then build the tree as above. Then, encode the image replacing each pixel pair with the appropriate bit string.

Here's a dodgy image I drew when trying to figure out what's going on:



As you can see the pixel pair '00' or 'blackblack' is the most common needing only 1 bit, with 0xf0 (white black) and aa (light green light green) also fairly common needing 5 bits each. This comes from the END.CPA file after Vertical XORring, so you can see why black and green are the most popular.

The Size Data
Well, the images are stored in their own special-format MSQ blocks. The format is slightly different from the MSQ blocks found in the GAMEx files, because the "MSQ" marker is PRECEDED by four bytes, and has a slightly different 'version number' than the GAMEx files.. These four bytes define how big the image is going to be when it's unpacked fully. It allows the program to pre-allocate space for the unpacking. In the END.CPA file, these four bytes are 00 48 00 00 or really 00 00 48 00 (little-endian!) which is 18,432 bytes. Thanks to our knowledge of the TITLE.PIC file, we suspect the END.CPA will be 288x127, which is 144 x 128, or 18,432 bytes.

msq
After these four bytes are the letters "msq".

The file number
Immediately following the letters 'msq' there is a byte which is the file number. Unlike the GAMEx files where this byte was the character "0" or "1", in this case the byte is the value 0x00 or 0x01. Then, the next byte after this defines the start of the tree data.

The Tree Data
The tree data is stored in "tail recursive" format. You have to take the take one BIT at a time. For each BIT which is Zero, make a new pair of child tree-nodes, and descend into one of them. As soon as you find a non-zero bit, then the next eight bits hold the data for that node. Once you've filled the data in, go back up a level in the tree, and recurse through the parent node's other child.

An Example. the bytes 0x4A 0x22 0x37 would be interpreted as follows:-

They are the bits 01001010 00101010 00110111.


 * 1) Create a root node. [[image:huffroot.gif]]
 * 2) Take the zero, and create two children off the root. Descend into the left child.[[image:huffroot2.gif]]
 * 3) Take the one. This signals the start of a data byte
 * 4) Take the next eight bits "00101000" and store them as the data for that node. (0x28) [[image:huffroot3.GIF]]
 * 5) Ascend back to the parent. Take off another bit "1" and descent into the second child. [[image:huffroot4.GIF]]
 * 6) Take the zero.and create two children off this node. Descend into the left child. [[image:huffroot5.GIF]]
 * 7) Take the one. This signals the start of a data byte
 * 8) Take the next eight bits "01000110" and store them as the data for that node. (0x46). [[image:huffroot6.GIF]]
 * 9) Ascend back to the parent. Take off another bit "0" and descend into the second child.
 * 10) AND SO ON...

Now this is a simplified version of events. What is really happening is that it starts off by diving down to the deepest left-hand node, and begins construction down there, gradually 'filling' the leaves of the tree from the bottom, upwards towards the root. Once you've done both the children of any node, you pop back up a level. This last piece of control ensures that when it's complete, the pointer to the tree returns back up the tree and 'pops off' the top of the tree (since it will be complete with all children of the root have been dealt with).

The Image Data
Immediately following the tree data, there lives the image data which uses the tree. Like the tree data, you take this data one bit at a time. Starting at the root, if you get a '1', head right, or if you get a '0' head left. At the point where you'd be heading off the bottom of the tree, you've reached a 'leaf' node. It holds the byte you need to insert into the image. Repeat this operation until you've unpacked the number of bytes which were indicated in the msq header preable.

Typically, this data will then need the Vertical XOR operation running over it to convert it into a finished graphic.

Some Code
If you just want to use the huff tree, I'll give you the C code to build the tree now. This is just written in plain old C with some global variables to speed it up a little.

Next, we need a function which can read bits off a file input stream.

Then we'll need something which can read 8 bits sequentially off the file input stream, and return them :

And lastly, we need something which can build a Node

So, in order to kick this off, create yourself a root node, read in the first character to get the party started, and set up the bit mask to be 0x80 .... Something like the following:

Apologies
You have my apologies if this was hard to follow. It's not the simplest procedure to describe. There are lots of sites with information about Huffman coding schemes if you're really interested in it. It was a real pig for us to figure this out. I have to admit, the Assembly code was tight, really damn tight. On the down side, if you make an error when coding up your huffman tree, the game will explode. It has NO error checking or detection upon tree reconstitution. I've had the game hang a few times altering bytes.

This has to be the single most convoluted thing we've had to work on yet. It's taken three nights of banging our heads off heavy objects to figure it out. Hopefully this will pave the way to extracting all the tilesets automatically.