3.3.8 Data Compression

Table of Contents

Fork me on GitHub

1 Data Compression

Learn It

  • Files, especially graphic and audio files, can become very large. Here are some intersting facts:
    • On Facebook more than 350 million images are uploaded each day. On Snapchat 8,796 images are uploaded per second. It has been estimated that over 3.2 billion images are uploaded to social media sites each day!
    • Every day millions of audio and video files are being downloaded.
  • When data is transmitted across the Internet, it will go through many different physical links between routers.
  • The connection from a computer or a LAN into the Internet is likely to be the slowest part of this route, as you probably know from experience.
  • At home you may have quite a slow network connection and it may take a while for web pages to load.
  • One way of speeding the rate at which files can be transmitted across the Internet is to compress them to make them smaller. Smaller files take less time to transmit over a network.
  • Understanding how compression affects files is important, as the type of compression selected will affect how the image looks or the audio track sounds. The final use of the file will dictate how much you can compress the files and still have a file that is uasable.

  • To summarise, compression is used in order to:
    1. Reduce the amount of storage needed on a computer to save files.
    2. Allow large files to be transmitted as an email attachment; many email servers limit the size of a file that can be sent and compression can reduce the file size to allow users to send it.
    3. Allow a file to be transmitted in less time, due to the smaller file size.
    4. It reduces congestion on the Internet through smaller files.
    5. It makes audio and video files suitable for streaming.

compression.png

2 Types of Compression

Learn It

Lossy Compression

  • Is a data encoding method where files are compressed by removing some of the detail. For example, photographs can be stored using fewer colours so fewer bits are needed per pixel.
  • This type of compression is used to compress images, audio and video files.
  • That is why it cannot be used for text files or program files - A book with many missing words would be unreadable!
  • But lossy compression can be used for graphic and audio files as they contain much data that can be discarded.
  • The most commonly used compression technique for graphic files was developed by the Joint Photographic Experts Group and produces JPEG files with the extension .jpg.
  • Here are two photographs to show the differences:

jpg_gif_image.jpg


Lossless Compression

  • With a lossless data encoding method, a file is compressed but no data is lost and the file can be decompressed with all of its information intact.
  • For example, bank records must keep all of the data; you cannot transmit a bank statment and miss out a few zeros because they don't matter too much!
  • It could be used to compress data files, for example by Zipping them using a utility program such as WinZip, before attaching them to an email.
The size of an image file depends on the colour depth and dimensions. The size of an audio
file depends on the sample rate and bit depth. The size of an image file and an audio file
can be very large.

Try It

Finish completing the following table showing different file types and file extensions used for different file formats.

Type File Suffix Compression Type Explanation
Bitmap .bmp - Uncompressed still image file
JPEG     Good for photos, colour depth=24bits, RGB, 16.7 million colours
Graphic Interchange Format   Lossless  
MP3 .mp3 Lossy  

3 Run Length Encoding (RLE)

Learn It

  • Run Length Encoding (RLE) is a simple form of lossless data compression which works by reducing the physical size of a sequence of

data having the same value and are stored using requency/data pairs.

  • This repeating string, called a run, is encoded into two bytes.
  • The first byte represents the number of characters in the run and the second gives the character.
  • For example, let's look at the following string:
    • aaabbbbbbccccccccc
    • This string length is 18 bytes.
    • Using run length encoding, this could be compressed to:
    • 3a6b9c
    • This string has been reduced to 6 bytes using RLE.

  • RLE provides very good compression ratios where there are long runs of one particular value like the following 1-bit black and white image:

rle.png

  • When represented by a letter, the size of the file is 64 bytes: 8 bytes per line.
  • Using Run Length Encoding will reduce the file size of this one character from 64 bytes to 48 bytes.

4 Huffman Coding

Learn It

  • Huffman coding, also known as Huffman Encoding or Huffman Compression. Is an algorithm for lossless compression based on the frequency of the characters or symbols in the file.
  • It ensures that the more common characters have fewer bits to represent them than the less common characters that need more bits to identify them.
  • Therefore the overall size of the file is reduced.
  • It is therefore called a variable-length coding system because the codes for different characters have different lengths.
  • Huffman coding uses a structure called a Binary Tree, which consists of a root node and a number of nodes and branches as shown in the diagram below:

huff_tree1.png

Try It

Constructing a Huffman Tree

  • For this example we will construct a Huffman Tree for the sentence: BOURNE_GRAMMAR.
  • Step 1: Draw a table showing the frequency of each character, including spaces. For example, there is one "B", one "E" and three "R"s in the sentence.
Character B E G N O U SPACE A M R
Frequency 1 1 1 1 1 1 1 2 2 3
  • (Check that the frequencies add up to 14 characters including spaces as in the given sentence.) Ensure that you write the frequencies in ascending order as shown above. Currently the total bit value for the uncompressed sentence is 14 chars x 7-bits = 98 bits.

  • Step 1: Combine pairs of frequencies, always choosing the pair that gives the smallest combined frequency. For example, B and E each have a frequency of 1. So B and E have a combined frequency of 2.
  • Create a node containing the combined frequencies of the first two pairs (2) and a branch for each of B and E.

tree1.png

  • Step 2: The next smallest frequency pair is formed by pairing the letters are G and N each also have a frequency of 1. These can then be grouped to form a new node with a value of 4.

tree2.png

  • Step 3: We have a choice when placing U and SPACE, it can be paired with the previous subtree or a new subtree. To keep the tree balanced we will create a new subtree with a frequency value of 2.

tree4.png

  • Step 4: The next two frequency pairs are O and A, having a frequency of 1 and 2. These can then be grouped to form a subtree with a value of 3, which can then be grouped together with the previous subtree to form a new node with a value of 5.

tree5.png

  • Step 5: We have another choice when placing M which has a frequency value of 2, it can be paired with either of the two subtrees or a new node. To keep the tree balanced we will create a new node with a new frequency value of 7.

tree6.png

  • Step 6: We have another choice when placing R which has a frequency value of 3, it can be paired with either of the two subtrees or a new node. To keep the tree balanced we will create a new node with a new frequency value of 7.

tree3a.png

  • Step 7: We need to add a root node to join the two subtrees together.

tree7.png

  • Step 8: The tree is now complete. You can add the labels below to each branch, with the left branches labelled 0 and the right branches labelled 1.

tree8.png

Using a Huffman Tree, the coding for each character is derived from the path taken from the
root node to the character. Branching left at a node is coded 0, branching right is coded 1.
Notice that the characters that occur most frequently are nearer the top and therefore
require fewer characters to encode them.
Therefore the character 'R' would be represented by the bit pattern 01 because from the top
of the tree, you go left, then right to reach 'R'.
The encoding for 'G' would be 0010 and for 'B', 0000.
The total number of bits needed to represent the sentence 'BOURNE GRAMMAR' would be
14 chars x 7-bits = 98 bits using 7-bit ASCII.
Through using Huffman Encoding the number of bits required would be 46 bits, representing a 
saving of 52 bits in the compressed format, with a 53% reduction in size.
  • We can reassemble our sentence with the Huffman Encoded binary values below each character, counting the number of bits gives us 46 bits compared to the original 98 bits that's a reduction of 53%.
Sentence B O U R N E SPACE G R A M M A R
Binary 0000 1110 1100 01 0011 0001 1101 0010 01 1111 10 10 1111 01

Huffman Coding Explained

  • The following video explains how a Huffman Tree works:

Badge It

Silver: For the following 1-bit graphic, calculate:

  • (a) The result of applying a run length encoding algorithm.
  • (b) The original file size and size after appplying run length encoding.

rle_silver.png

Badge It

Gold: The following Huffman Tree represents the text;'HELEN FEEDS THE EELS' huff_badge.jpg

  • (a) Complete the table showing the Huffman coding for S, T and SPACE.
Character Huffman Coding
S  
T  
SPACE  
  • (b) What does the following code represent?
  • 0101 00 0100 011 110 100 00 101 101 110 011 1111 00 101 101

Badge It

  • Platinum: Using the Huffman code sentence 'HELEN FEEDS THE EELS' which can be stored in 57 bits.
  • (a) Calculate the number of bits that would be needed to store the sentence in ASCII?
  • A frequency table for characters in a document is shown below:
A B C D E
20 15 9 8 5
  • (b) Create a Huffman Tree for this set of characters.

Validate