Course: CIS451/651 Data Compression in Multimedia
Professor: Paul D. Amer
Semester: Spring 2013
Title: Project 2 - Simple Run Length Encoding/Decoding
Description
(40 points) Company "UD-Banners-and-Art" finds it necessary to
store millions of ASCII art files and banner-like posters (e.g., see unix command "banner" to see how to make a banner
using only a single character ). To save disk
space, a promising CISC451/651 student proposes the following simple run length
coding scheme. For a given ASCII input file, each "run" of identical
characters in an original file is compressed into a 'RunLength
- Character' pair. In the following example, an original file contains 33
bytes including 2 bytes for blanks and 1 byte for a newline character indicated
by "<nl>". Count the 33 bytes
below:
Consider an original file
example.txt containing:
AAAbbbbbbb YYYYYYYYYYYY6XXXY!!!<nl>
Using the following self-describing compressed file
organization specification:
typedef struct
{
ASCII-OCTET Signature[7];
/* header signature (always ASCII "amerRLE")
*/
ASCII-OCTET
Version;
/* amerRLE format version (initially ASCII "1")*/
ASCII-OCTET AlphabetLength[3]; /* size
alphabet (ASCII "001" to "128")
*/
ASCII-OCTET Alphabet[AlphabetLength];
/* input alphabet
*/
struct
{
ASCII-OCTET RunLength; /*
# of times Character repeats ("1" to "9")*/
ASCII-OCTET Character; /* Character repeated RunLength times */
} DataPair[]; /*
variable number of pairs; different for */
/* each file being compressed
*/
} AMERRLE;
The compressed file example.ARLE
including metadata would contain 39 bytes and look like (remember <nl> represents 1 byte):
amerRLE1008Ab
Y6X!<nl>3A7b2 9Y3Y163X1Y3!1<nl>
If you were to edit or print the file
example.ARLE,
you would see (note: there is a <nl> at the end of the 2nd line):
amerRLE1008Ab Y6X!
3A7b2 9Y3Y163X1Y3!1
For this example, the original 33 byte file gets compressed into 19 bytes of metadata plus 20 bytes of compressed datapairs for a total of 39 bytes, i.e., overall negative compression!
Tasks
- Your solution may be programmed in any language.
If you intend to do the extra credit Project 3, then you need to do Project 2 in Java.
Project 3 converts your Project 2 decoder into a Java applet.
The following requirements are specified in a Java context.
- Write well-commented,
perfectly-indented Java programs: amerRLE_encode
and amerRLE_decode.
For students working in a group of 2,one person should program amerRLE_encode, and the
other person program amerRLE_decode.
Try not to interact until each program has been initially
debugged. Then interact to get both
programs working properly. I expect
that variable names, indentation style, etc., will differ significantly
between the coder and decoder since each is written by a different
programmer.
- amerRLE_encode takes a single argument -
a <filename> to be compressed.
As output, amerRLE_encode
produces <filename>.ARLE If the output file
already exists, prompt the user to see if the output file should be
overwritten. If
the input file is an invalid file for being compressed, output a meaningful error message
such as "Error: file cannot be compressed because <reason>" and
abort. An input file would be invalid if, for example, it contains
any byte value in the range 128-255.
- amerRLE_decode takes a single argument -
a <filename>.ARLE to be decompressed. As output, amerRLE_decode
produces <filename>.RECONSTRUCTED If the output
file already exists, prompt the user to see if the output file should be
overwritten. The input file
to be uncompressed MUST have the extension ".ARLE" and
its format MUST be valid. If the input is not a valid amerRLE compressed file, output a meaningful error
message specifying exactly where the input file is incorrect,
and abort decompression. For example, a file to be
uncompressed would be invalid if the signature or version is incorrect, or
if any DataPair.Character is not defined in the
Alphabet, or if the DataPair.RunLength is not
within "1"-"9". Other examples of invalid files
exist. Think of others, and carefully document which errors your
code tests.
- Recommendation: before
any programming, make up examples by hand of what you think are
valid and invalid .ARLE files. Then later use these files to help
debug your programs. For groups of 2, each person should generate test files
for the other person's program.
What to Submit
- (submit) Use the unix
command “script” (or something analogous) to document that your programs work correctly:
- script
- print (e.g., cat) both programs (e.g., cat amerRLE_encode.java)
- compile both programs (e.g., javac amerRLE_decode.java)
- list the directory showing creation date/time/size of compiled code (e.g., ls -l amerRLE*)
- for 3 different testfiles, repeat: (the first of the 3 testfiles must be
apple.txt)
- cat <testfile>
- amerRLE_encode <testfile>
- cat <testfile>.ARLE
- amerRLE_decode <testfile>.ARLE
- cat <testfile>.RECONSTRUCTED
- diff <testfile> <testfile>.RECONSTRUCTED
- list directory showing creation date/time/size of files (e.g., ls -l *.txt *.ARLE)
- run your decoder on the following specific test files. Print the decoded file if possible after each decoding.:
- end script
- WARNING!: Be careful copying, moving, cutting-and-pasting
test files. Most file transfer programs have the annoying habit of introducing/deleting newline characters when transferring files between
Windows (uses cr-lf at the end of every line of textfiles) and Unix (uses only lf at the end of every line of textfiles) systems using ASCII mode (often the default). Using ftp in "binary" mode avoids this problem. Verify that after your transfer that the numbers of bytes in the original and copied files are equal.
For example, apple.txt should have exactly 1005 bytes.
- (submit) Written performance
evaluation report (maximum 2 pages - strictly enforced!).
Evaluate the performance of amerRLE
compression as a compression scheme for ASCII art and banner files.
(You are not evaluating amerRLE as
a compression scheme for all files, only as a scheme for ASCII art
and banner files.) Your evaluation should be written at a technical
level. Do NOT restate the assignment description in your
report.
- In your performance study, you MUST include at least 5 of my
.txt test files
including: apple.txt, hello_world.txt, mona_lisa.txt, paul_amer.txt, and penquin.txt.
IMPORTANT: You MUST also find or create at least 5 additional ASCII art and banner
files to use in your evaluation. You MUST also make up test .ARLE files; only my specific test .ARLE files mentioned above are available. The others are meant to be protected.
(Check out http://ascii.mastervb.net/ and http://www.glassgiant.com/ascii/)
In your report, provide a summary table of at least 20 uncompressed vs. compressed files including my test .txt files.
- (submit) In your report, discuss the
theoretical best (i.e., maximum or most positive) and worst (i.e., minimum or most negative)
compression that can possibly result from amerRLE encoding a file with N
characters. Ignore the fixed overhead for the metadata. Be as general as you can in describing files that
would produce these best and worst results.
- (submit) For this assignment, after
a full effort at debugging and testing, each student or
group should test their code
with one other student or group as follows. Group A encodes a file which is
then decoded by Group B. Is the decoded .RECONSTRUCTED file the same as the original? Then reverse roles
using Group B's encoder and Group A's decoder. Use the unix command ‘diff’
to verify that the original and .RECONSTRUCTED files are
identical. In your report, document
with which group you tested your code, and what happened (i.e., did all
code work the first time? If not, what was
the problem? Be honest in documenting what happens; there is no
grade penalty for programs that did not initially work correctly.)
Notes
- If a character run is of
length 10 or more, compress the run into multiple DataPairs.
- An uncompressed file may contain
any of the valid 128 ASCII characters. Remember that <nl> is a valid ASCII
character. The compressed file's alphabet should consist of
one instance of each character in the uncompressed file.
- The AMERRLE struct specified above is an abstract pseudo-code
specification to help each student understand the assignment; it is not a
required item for your implementation. The variable names are strongly recommended.
- See the syllabus for the lateness and academic honest policies.
- Grading will be based on:
- Code correctness and
testing (60%)
- Code documentation
and readability (see Amer's Coding
Suggestions) (25%)
- program uses meaningful variable
names (meaningless generic names such as {x, y, z, data, count, flag, stop, input, output, infile, outfile, value, returnValue, calculatedValue, answer, number1, number2} will result in loss of credit
- programs's indentation matches the code's control flow
- program code is explained with meaningful comments
- Note: source listings submitted with lines
that wraparound when printed will lose points, and in extreme cases may be returned without grading
- Performance evaluation report (15%)