Course: CIS451/651 Data Compression in Multimedia
Professor: Paul D. Amer
Semester: Spring 2009
Title: Project 2 - Simple Run Length Encoding/Decoding
Due Date:

Description

(40 points) Company "UD-Banners-and-Art" finds it necessary to store billions of ASCII art files and banner-like posters (e.g., see unix command "banner" to see how to make a banner using only a single character ). To save disk space, a promising CISC451/651 student proposes the following simple run length coding scheme. For a given ASCII input file, each "run" of identical characters in an original file is compressed into a 'RunLength - Character' pair. In the following example, an original file contains 33 bytes including 2 bytes for blanks and 1 byte for a newline character indicated by "<nl>". Count the 33 bytes below:

original file named “example.txt” : AAAbbbbbbb YYYYYYYYYYYY6XXXY!!!<nl>

Using the following self-describing compressed file organization specification:

typedef struct

{

ASCII-OCTET Signature[7]; /* header signature (always ASCII "amerRLE") */

ASCII-OCTET Version; /* amerRLE format version (initially ASCII "1")*/

ASCII-OCTET AlphabetLength[3]; /* size alphabet (ASCII "001" to "128") */

ASCII-OCTET Alphabet[AlphabetLength]; /* input alphabet */

struct

{

ASCII-OCTET RunLength; /* # of times Character repeats ("1" to "9")*/

ASCII-OCTET Character; /* character repeated RunLength times */

} DataPair[]; /* variable number of pairs; different for */

/* each file being compressed */

} AMERRLE;


the full compressed file “example.txt.ARLE” would contain 39 bytes and look like (remember <nl> represents 1 byte):

amerRLE1008Ab Y6X!<nl>3A7b2 9Y3Y163X1Y3!1<nl>

If you were to print the file “example.txt.ARLE”, you would see 3 lines of output (note: the third line is blank):

amerRLE1008Ab Y6X!
3A7b2 9Y3Y163X1Y3!1

For this example, the original 33 byte file gets into 39 bytes (ie, negative compression!)

Tasks

  1. Write well-commented, perfectly-indented Java programs: amerRLE_encode and amerRLE_decode. After discussing the general concepts of this compression method, students working in a group of 2 should have one person program amerRLE_encode, and the other person program amerRLE_decode. Try not to interact until each program has been initially debugged. Then interact to get both programs working properly. I expect that variable names, indentation style, etc., will differ significantly between the coder and decoder since each is written by a different programmer.
  2. amerRLE_encode takes a single argument - a <filename> to be compressed. As output, amerRLE_encode produces <filename>.ARLE If the output file already exists, prompt the user to see if the output file should be overwritten, or simply abort compression and output a warning. If the input file is an invalid file for being compressed, output a warning "Error: file cannot be compressed because <reason>" and abort. An input file would be invalid if, for example, it contains any byte value in the range 128-255.
  3. amerRLE_decode takes a single argument - a <filename>.ARLE to be decompressed. As output, amerRLE_decode produces <filename>.RECONSTRUCTED If the output file already exists, prompt the user to see if the output file should be overwritten, or simply abort decompression and output a warning. The input file to be uncompressed MUST have the extension ".ARLE" and its format MUST be valid. If the input is not a valid amerRLE compressed file, output a meaningful error message, and abort decompression. For example, a file to be uncompressed would be invalid if the signature or version is incorrect, or if any DataPair.Character is not defined in the Alphabet, or if the DataPair.RunLength is not within "1"-"9". Other examples of invalid files exist. Think of others, and carefully document which errors your code tests.
  4. Strong Recommendation: before beginning any programming, make up examples by hand of what you think are valid and invalid .ARLE files. Then use these files to help debug your programs.

What to Turn In

  1. (submit) Use the unix command “script” to show your programs work correctly:
      • cat <testfile>
      • amerRLE_encode <testfile>
      • cat <testfile>.ARLE
      • amerRLE_decode <testfile>.ARLE
      • cat <testfile>.RECONSTRUCTED
      • diff <testfile> <testfile>.RECONSTRUCTED
      • ls -l <testfile>*
  1. (submit) Written performance evaluation report (maximum 2 pages - strictly enforced!). Evaluate the performance of amerRLE compression as a compression scheme for ASCII art and banner files. (You are not evaluating amerRLE as a compression scheme for all files, only as a scheme for ASCII art and banner files.) Your evaluation should be written at a technical level. Do NOT restate the assignment description in your report. You may use some of my test files containing ASCII art. IMPORTANT: You must find or create at least 5 additional ASCII art and banner files to use in your evaluation. (Check out http://ascii.mastervb.net/ and http://www.glassgiant.com/ascii/) In your report, provide a summary table of at least 20 uncompressed vs. compressed files.
  2. (submit) In your report, discuss the theoretical best (i.e., maximum or most positive) and worst (i.e., minimum or most negative) compression that can possibly result from amerRLE encoding a file with N characters. (Ignore the fixed overhead for the signature, version, alphabet length and alphabet.)
  3. (submit) For this assignment, after a full effort at debugging and testing, each group should test their code with one other group as follows. Group A encodes a file which is then decoded by Group B. Is the decoded .RECONSTRUCTED file the same as the original? Then reverse roles using Group B's encoder and Group A's decoder. Use the unix command ‘diff’ to verify that the original and .RECONSTRUCTED files are identical. In your report, document with which group you tested your code, and what happened (i.e., did all code work the first time? if not, what was the problem? Be honest in documenting what happens; there is no grade penalty for programs that did not initially work correctly.)

Notes

  1. If a character run is of length 10 or more, compress the run into multiple data pairs.
  2. An uncompressed file may contain any of the valid 128 ASCII characters. Remember that <nl> is a valid ASCII character. The compressed file's alphabet should consist of one instance of each character in the uncompressed file.
  3. Programs must be written in Java. If you do not know Java, now is the time to learn.
  4. The AMERRLE struct specified above is an abstract pseudo-code specification to help each student understand the assignment; it is not a required item for your implementation. (The variable names are strongly recommended.)
  5. Students are expected to work in groups of 2 unless given the instructor's permission to work alone. Working together means just that; during the coding/debugging/testing, both students are physically working on the assignment in the same room. Each group should turn in one listing and evaluation. Each student must be able to explain all code if asked to do so by the professor or TA. If a student cannot explain both the coder and decoder programs, the student will receive no credit for the assignment.
  6. Grading will be based on: