String Compressor

Back to MSQ Block page

String Compression
Almost all text strings within the game are compressed and encoded. The scheme itself at first looks similar to Huffman Coding. There is a key at the beginning, and then what seems to be variable-length encoded data after it. However, it is not. Every block of strings follows the same format. First there is a String Header, then a String Body.

For the purposes of decrypting a particular string from the String Body, we need to use the Character Table from the Accompanying Header. Encoded string blocks appear in two places, mainly. Firstly, the appear as the final major bit of the MSQ blocks in the GAMEX files, and they also pop up in the Data Segment 2 of the WL.EXE program.

NOTE
There is some issue with the string compressor algorithm. The 'end-of-string; marker was thought to be a 0x00 resultant code. It's not. 0x00 means 'pause for a second'. We're still investigating this. But at the moment, it's unknown how the strings terminate.

The Character Table
The character table in the Header contains a list of the characters used in the Body. Because of its layout, this restricts the map designer, or string writer, to a maximum of 60 different characters. Capitalisation is not included in this, so you can have all 26 letters, 10 numbers and 24 'others'. Because of the way the strings are encoded, two characters are 'specials', therefore you are restricted to 22 'others'. In every map viewed, this limit is not reached, so it looks like the designers chose well. For the string blocks in the WL.EXE program, the number of used characters does approach the full 60.

Here's the Character table for Highpool (offset 0x1004 from the start of MSQ block 10 in Game1): Here are some examples of the string headers. I've only included the first 25 bytes, because after this point, it's mostly control codes.

The obvious observation is to say "Aha, they've obfuscated the strings by making a character lookup table and all the successive bytes are just offsets in the lookup table, so "HighPool" becomes 7,6,17,7,19,2,2,10. Unfortunately, this is not so.

Letter Frequency Link?
A common reference for encoding is the average letter frequency in the english language... If you lay the most common letters out (|as calculated for common fiction words), in order, you get the string:


 * e t a o h n i s r d l u w m c g f y p v k x j x z q

Look familiar? :) Text strings can be the bane of people writing small code games.. So, since the strings section is going to be encoded and needs to be compressed you usually find which are the most frequent letters and make them use fewer bits to represent them. They sort-of do this here. The first 30 characters can be used using 5 bits, but the characters from 31->60 need 10 bits. This actually works because the characters from 0->31 appear FAR more often, and so the average size drops close to 5-bits-per-character. There are better information-theoretic compressors than this in use within the game.

How the Compression Works
Let's assume that the Character Table in the String Header contains the following letters:

Then, you could encode the strings "nose toes" as "8,3,5,1,0,2,3,1,5". These are the bit strings:

01000,00011,00101,00001,00000,00010,00011,00001,00101.

Because of the little-endianness of the data, these are then placed into consecutive bytes, starting at the least significant bit, and working LEFT. This means that when you're slurping the bytes from the file, you're once again going R->L within a byte, but L->R between bytes. I'm sure this all makes sense somewhere. Perhaps in a deep dark old Intel 4044 or 8088 manual somewhere where some fool went "I know let's have all the bytes in the wrong order!".

Now, if you read these back out, and take the 5 least significant bits (right-most) of each byte, you get:

01101000 -> 01000 which is 8, which in the table is

n remainder 011

10010100 with 011 from previous -> 00011 which is 3, which in the table is

o remainder 100101.

00001 with 100101 from previous -> 000101 which is 5, which in the table is

s remainder 000011

and so on.. taking the 5 least-significant bits each time. the actual algorithm used by us was to extract 2 bytes at once, and then shift the bytes down until there weren't enough bits (>5) left, and then load in a new byte into the space, and repeat until the end of the string is reached.

Special Codes
There are three 'specials' which need to be kept in mind when decoding the bytes.
 * If the 5-bit value is 0x1e, this means that the next decoded char will be capitalised (its ascii value will be subtracted by 20 to transform it from a->z to A->Z).
 * If the 5-bit value is 0x1f, this means that the next 5-bit value will be incremented by 0x1e before being looked up in the character table. In this way the full range of 60 entries in the char table are addressable using only 5 bits. Because the char table is in sorted order, only less-likely characters are at addresses 30 and above. neat.
 * After character decoding, if the resulting character is 0x00, then this marks the end of the string.

Re-creating the String Header for encoding.
If you wanted to reconstruct the String Section, by, say altering a message to a different length, the steps you would have to go through are:
 * 1) Count the letter occurrences in your strings, and order them most frequent->least frequent.
 * 2) Insert a '0x00' at position 0x1f (for reasons explained in the String Body Section), shifting everything to the right by one byte.
 * 3) Count the number of strings, and insert this many 2-byte shorts into the header (as spacing for the pointer table).
 * 4) Encode each string one by one using the string compression algorithm defined in the strings body section.
 * 5) For each string, insert the byte offset of the first byte of its decoded form into the pointer table.