Course: CIS451/651 Data Compression in Multimedia
Professor: Paul D. Amer
Semester: Spring 2013
Title: Project 2 - Simple Run Length Encoding/Decoding

Description

(40 points) Company "UD-Banners-and-Art" finds it necessary to store millions of ASCII art files and banner-like posters (e.g., see unix command "banner" to see how to make a banner using only a single character ). To save disk space, a promising CISC451/651 student proposes the following simple run length coding scheme. For a given ASCII input file, each "run" of identical characters in an original file is compressed into a 'RunLength - Character' pair. In the following example, an original file contains 33 bytes including 2 bytes for blanks and 1 byte for a newline character indicated by "<nl>". Count the 33 bytes below:

Consider an original file example.txt containing:

AAAbbbbbbb YYYYYYYYYYYY6XXXY!!!<nl>

Using the following self-describing compressed file organization specification:

typedef struct

{

ASCII-OCTET Signature[7]; /* header signature (always ASCII "amerRLE") */

ASCII-OCTET Version; /* amerRLE format version (initially ASCII "1")*/

ASCII-OCTET AlphabetLength[3]; /* size alphabet (ASCII "001" to "128") */

ASCII-OCTET Alphabet[AlphabetLength]; /* input alphabet */

struct

{

ASCII-OCTET RunLength; /* # of times Character repeats ("1" to "9")*/

ASCII-OCTET Character; /* Character repeated RunLength times */

} DataPair[]; /* variable number of pairs; different for */

/* each file being compressed */

} AMERRLE;

The compressed file example.ARLE including metadata would contain 39 bytes and look like (remember <nl> represents 1 byte):

amerRLE1008Ab Y6X!<nl>3A7b2 9Y3Y163X1Y3!1<nl>

If you were to edit or print the file example.ARLE, you would see (note: there is a <nl> at the end of the 2nd line):

amerRLE1008Ab Y6X!
3A7b2 9Y3Y163X1Y3!1

For this example, the original 33 byte file gets compressed into 19 bytes of metadata plus 20 bytes of compressed datapairs for a total of 39 bytes, i.e., overall negative compression!

Tasks

Your solution may be programmed in any language. If you intend to do the extra credit Project 3, then you need to do Project 2 in Java. Project 3 converts your Project 2 decoder into a Java applet. The following requirements are specified in a Java context.
Write well-commented, perfectly-indented Java programs: amerRLE_encode and amerRLE_decode. For students working in a group of 2,one person should program amerRLE_encode, and the other person program amerRLE_decode. Try not to interact until each program has been initially debugged. Then interact to get both programs working properly. I expect that variable names, indentation style, etc., will differ significantly between the coder and decoder since each is written by a different programmer.
amerRLE_encode takes a single argument - a <filename> to be compressed. As output, amerRLE_encode produces <filename>.ARLE If the output file already exists, prompt the user to see if the output file should be overwritten. If the input file is an invalid file for being compressed, output a meaningful error message such as "Error: file cannot be compressed because <reason>" and abort. An input file would be invalid if, for example, it contains any byte value in the range 128-255.
amerRLE_decode takes a single argument - a <filename>.ARLE to be decompressed. As output, amerRLE_decode produces <filename>.RECONSTRUCTED If the output file already exists, prompt the user to see if the output file should be overwritten. The input file to be uncompressed MUST have the extension ".ARLE" and its format MUST be valid. If the input is not a valid amerRLE compressed file, output a meaningful error message specifying exactly where the input file is incorrect, and abort decompression. For example, a file to be uncompressed would be invalid if the signature or version is incorrect, or if any DataPair.Character is not defined in the Alphabet, or if the DataPair.RunLength is not within "1"-"9". Other examples of invalid files exist. Think of others, and carefully document which errors your code tests.
Recommendation: before any programming, make up examples by hand of what you think are valid and invalid .ARLE files. Then later use these files to help debug your programs. For groups of 2, each person should generate test files for the other person's program.

What to Submit

(submit) Use the unix command “script” (or something analogous) to document that your programs work correctly:

script
print (e.g., cat) both programs (e.g., cat amerRLE_encode.java)
compile both programs (e.g., javac amerRLE_decode.java)
list the directory showing creation date/time/size of compiled code (e.g., ls -l amerRLE*)
for 3 different testfiles, repeat: (the first of the 3 testfiles must be apple.txt)

cat <testfile>
amerRLE_encode <testfile>
cat <testfile>.ARLE
amerRLE_decode <testfile>.ARLE
cat <testfile>.RECONSTRUCTED
diff <testfile> <testfile>.RECONSTRUCTED

list directory showing creation date/time/size of files (e.g., ls -l *.txt *.ARLE)
run your decoder on the following specific test files. Print the decoded file if possible after each decoding.:
end script

WARNING!: Be careful copying, moving, cutting-and-pasting test files. Most file transfer programs have the annoying habit of introducing/deleting newline characters when transferring files between Windows (uses cr-lf at the end of every line of textfiles) and Unix (uses only lf at the end of every line of textfiles) systems using ASCII mode (often the default). Using ftp in "binary" mode avoids this problem. Verify that after your transfer that the numbers of bytes in the original and copied files are equal. For example, apple.txt should have exactly 1005 bytes.
(submit) Written performance evaluation report (maximum 2 pages - strictly enforced!). Evaluate the performance of amerRLE compression as a compression scheme for ASCII art and banner files. (You are not evaluating amerRLE as a compression scheme for all files, only as a scheme for ASCII art and banner files.) Your evaluation should be written at a technical level. Do NOT restate the assignment description in your report.
In your performance study, you MUST include at least 5 of my .txt test files including: apple.txt, hello_world.txt, mona_lisa.txt, paul_amer.txt, and penquin.txt. IMPORTANT: You MUST also find or create at least 5 additional ASCII art and banner files to use in your evaluation. You MUST also make up test .ARLE files; only my specific test .ARLE files mentioned above are available. The others are meant to be protected. (Check out http://ascii.mastervb.net/ and http://www.glassgiant.com/ascii/) In your report, provide a summary table of at least 20 uncompressed vs. compressed files including my test .txt files.
(submit) In your report, discuss the theoretical best (i.e., maximum or most positive) and worst (i.e., minimum or most negative) compression that can possibly result from amerRLE encoding a file with N characters. Ignore the fixed overhead for the metadata. Be as general as you can in describing files that would produce these best and worst results.
(submit) For this assignment, after a full effort at debugging and testing, each student or group should test their code with one other student or group as follows. Group A encodes a file which is then decoded by Group B. Is the decoded .RECONSTRUCTED file the same as the original? Then reverse roles using Group B's encoder and Group A's decoder. Use the unix command ‘diff’ to verify that the original and .RECONSTRUCTED files are identical. In your report, document with which group you tested your code, and what happened (i.e., did all code work the first time? If not, what was the problem? Be honest in documenting what happens; there is no grade penalty for programs that did not initially work correctly.)

Notes

If a character run is of length 10 or more, compress the run into multiple DataPairs.
An uncompressed file may contain any of the valid 128 ASCII characters. Remember that <nl> is a valid ASCII character. The compressed file's alphabet should consist of one instance of each character in the uncompressed file.
The AMERRLE struct specified above is an abstract pseudo-code specification to help each student understand the assignment; it is not a required item for your implementation. The variable names are strongly recommended.
See the syllabus for the lateness and academic honest policies.
Grading will be based on:

Code correctness and testing (60%)
Code documentation and readability (see Amer's Coding Suggestions) (25%)

program uses meaningful variable names (meaningless generic names such as {x, y, z, data, count, flag, stop, input, output, infile, outfile, value, returnValue, calculatedValue, answer, number1, number2} will result in loss of credit
programs's indentation matches the code's control flow
program code is explained with meaningful comments
Note: source listings submitted with lines that wraparound when printed will lose points, and in extreme cases may be returned without grading

Performance evaluation report (15%)