World Photography Forum - View Single Post

robski · #5 09-07-07, 23:27

Let's look at some of the simpler lossless data compression ideas for background information. Unless your going to design a program there is no point in understanding it to the nth degree.

They basically take advantage of waste, unused values or inefficiently encoded raw data.

A simple example is some English plain text with each character stored in an 8bit byte. Data stored in a byte can hold 1 of 256 values. A typical text message is make up of 52 upper and lower case characters, 10 numbers, a dozen punctuation marks and a few control codes. Over half of the possible positions (values) are never used. In this case a 7bit code (128 values) would be more efficient. The 7bit codes would be packed together to fit into the 8bit bytes reducing the file size by ~12.5%. Not a very useful method for graphic images but illustrates a point.

Another idea is Run Length Encoding. This takes advantage of long streams of identical data.

For example AAAAAAAHHCCCCCYYYYYYRRRRRRRRRR
could be re-encoded as 7A2H5C6Y10R The number indicates the run length followed by the value of the run.

or in the case of binary data in a 1bit monochrome image ( Fax machine )
00000011100001111111101
could be re-encoded as 634811 to indicate how many zeros and ones follow each other.

If you think about it. Long streams of the same information it is telling us something we already know and therefore redundant. This information can be discarded.

There is a big disadvantage with the run length method when the data changes very rapidly

For example AHCYR
would be re-encoded as 1A1H1C1Y1R which is in fact larger than the original.

Rapidly changing data contains more information which needs to be retained by using more code.

Two Common 1bit fax compressions are group 3 and group 4. The group 3 works in one direction only along the row of pixels. Group 4 works in both directions and takes into account multiple rows. Although group 4 gives higher compression ratios it has more work to do which results in it being slower than group 3 when compressing or decompressing.

Another idea is to look for common Patterns in the raw data and replace them with shorter special codes.
An example in text data is to replace common words or parts of words ( the, and, or, in, to, un, ed, for) with a shorter unique code.

The last idea involves more processing by counting the number of times a value occurs in the data. The most frequent values are assigned short codes which get longer and longer as the frequency of the value reduces.

A perfect example of this is Morse Code - remember that thing with dots and dashes on the radio. The letter E (the most common character in English) is a single dot and less common characters like J & Q may have 4 or more dots and dashes. The overall effect is to shorten the coding on a typical message.

In cases such as English text the typical frequency of characters is well know and a standard coding table or dictionary can be used instead of calculating a new table from scratch each time. In essence we are calculating or predicting the probability of values in the raw data. Huffman coding is based on this idea.

What is important is to match compression method to the type of information in the data. Choosing the wrong method will either make the file bigger or give poor compression. Advanced compression programs first analysis the data to decide which is the best method to use.

The LZW Compression is very effective and uses the more complex ideas but unfortunately it's use required a license. So not to be found in your cheap and cheerful photo editing software.