Penguin
Note: You are viewing an old revision of this page. View the current version.

Back in the 70's, the amount of data that people wanted to store started outstripping the media that people had to store it on.

Two computer scientists named A. Lempel and J. Ziv published a paper called "A Universal Algorithm for Sequential Data Compression" in the IEEE Transactions on Information Theory, May 1977. This documented an Algorithm that was referred to subsequently as LZ77.

The basis for the LZ series of algorithms is the building of a dictionary of common phrases in a file. Instead of outputting characters, phrase numbers in this dictionary can be output, achieving compression when you can output the number 2832 instead of the string "foobarbaz".

The decompressor needs nothing more than this sequence of output and a knowledge of how the dictionary is built to be able to built the dictionary itself, and therefore decompress the file.

For each string that LZ77 compresses, it outputs an index back into to the input (go back N character), a window size (copy these characters), and the first mismatching character.

Then, in 1978, they came up with a better algorithm (dubbed LZ78, that only required the compressor to output a phrase number and a mismatching character. The phrase extended with the mismatching character would then be added to the dictionary.

LZ77/78 encoding remained the de-facto standard for compression of data (along with some cool stuff like HuffmanCoding) until 1984, where Terry Welch published "A Technique for High-Performance Data Compression" in Computer magazine. He realised that with a dictionary that had all the possible characters in it already, you wouldn't have to output the mismatching character like in LZ78; you could output a phrase number that matched it. This algorithm is called LZW encoding (Lempel, Ziv, Welch).

How it works

The algorithm is really quite simple, and will be described here.

  • Start with a dictionary that contains all possible characters.
  • P refers to the previous phrase (initially empty).
  • Read a character, C.
  • Is the string P+C in the dictionary? (It is - P is empty so P+C = C)
  • Output the code number for string C to the codestream
  • Set the previous string P = C.
  • While there are still character remaining
  • Read a character
  • Is the string P+C present in the dictionary?

    • If yes, set P to P+C
    • If no,

      1. output the code word for P
      2. add the string P+C to the dictionary
      3. Set P to C
  • When we run out of characters, output the code word for the string P and end.

A really good link to LZW is http://www.rasip.fer.hr/research/compress/algorithms/fund/lz/lzw.html.

Why we only talk about it in hushed terms

Something else that is important to understand is http://www.unisys.com/about__unisys/lzw/ - Unisys own it. I believe they used to allow free usage of it until they realised "Hey! There's a lot of people using GIF, so lets assert our ownership of the patent and demand license fees from their asses."


Part of CategoryAlgorithm. (I just made it up.)