my_compress {-LZW or -LZY or -H or -D } -I filename -O filenameExactly one of -LZW or -LZY or -H or -D must be specified. The -I and -O arguments are required.
The arguments have the following meaning. Each algorithm is described below.
-LZW
Compress with Lempel-Ziv using a fixed 12 bit table.
-LZY
Compress with Lempel-Ziv using a variant of your
choosing.
-H
Compress using Huffman encoding.
-D
Decompress any of the above 3 formats. Note that the
compression scheme used will not be specified on the command line.
-I filename
The file to be compressed or uncompressed.
-O filename
The output file to be written to.
Your program may not create or modify any file other than the output file.
Note: In order for the uncompress option to work with any of the four possible compress algorithms it will be necessary to the output files encode a message explaining how they were encoded. We require that this be done by using a different 4 letter set of ASCII characters at the start of each compressed file. This is discussed in more detail below.
You may not use or copy or use code from any source other than our text, the STL and the standard C/C++ libraries. You should clearly cite any code you use from our text.
The following section is from the comp.compression FAQ. We suggest you at least scan the whole entry for question #70 in part 2. Much of the FAQ is actually quite interesting. This part is written by Peter Gutmann. Other links include an animation of the LZ algorithm and a more acadmic treatment of the LZ algorithms.
Input string: /WED/WE/WEE/WEB Character input: Code output: New code value and associated string: /W / 256 = /W E W 257 = WE D E 258 = ED / D 259 = D/ WE 256 260 = /WE / E 261 = E/ WEE 260 262 = /WEE /W 261 263 = E/W EB 257 264 = WEB *END* BLZW starts with a 4K dictionary, of which entries 0-255 refer to individual bytes, and entries 256-4095 refer to substrings. Each time a new code is generated it means a new string has been parsed. New strings are generated by appending the current character K to the end of an existing string w. The algorithm for LZW compression is as follows:
set w = NIL loop read a character K if wK exists in the dictionary w = wK else output the code for w add wK to the string table w = K endloop
The difficult part now is to see how the decompresser has enough information
to reconstruct the data exactly. The key is that the decompresser can build
exactly the same table as the compressor built. So the first 4 things in
output by the compressor are / W E D
. Note that each of those
are output as 12-bit values rather than 8 bit chars.
A key question is when the
decompresser sees the 256, how can it know that is supposed to be a
/W
? The answer is that the decompresser knows the first 4
characters the compressor saw. So knows that the compressor had a table
that consisted of
256 = /W 257 = WE 258 = EDSo the decompresser knows exactly what to do with the 256. This observation is key to understanding all of the LZ algorithms.
-LZW
option you are to use a fixed table length of 4096
entries, each of 12 bits (as described above.) Once you get to 4096 table
entries the table becomes static and unchanging.
The basic idea of Huffman encoding is to come up with an encoding scheme that uses shorter encodings for often-occurring strings (or characters). An outstanding description of Huffman encoding is found at the University of Western Australia. Be sure to follow the link to the operation of the Huffman algorithm. Also, the FAQ referenced in the LZ section also has some information on Huffman encoding.
One can do Huffman encoding on characters, groups of characters, words, or whatever else. For this assignment we ask that it be done on a character basis.
The data is simply the encoded message using the encodings generated by the Huffman algorithm.
The dictionary contains the translation from the original encoding to the Huffman encoding. There are many ways to do this.
The header contains whatever other information is needed by the decompresser to be able to decompress the file. This could include the number of dictionary entries and whatever else your algorithm might need.
AAABBBAAAC
. Our Huffman encoding scheme might generate encodings
whereA = 1 B = 01 C = 00The header might consist of a single integer indicating how many dictionary entries there are. There can't be more than 256 different characters, so this can safely be stored in 7 bits. Further the operating system will tell us if we have hit the end of the file. But it does this on a character basis rather than a bit basis. It is very important to know at what _bit_ the data ends. We thus include a character which tells us what bit the last character ends on. (This is number of bits in the data modulo 8.)
We can use standard ASCII to identify each of the
characters in the dictionary using 8 bits. Further we are sure we will have
very few characters in each file, so we guess that no encoding will have
more than 7 bits. However we notice that, for example the encoding for
1
and 01
, using 7 bits would both be
0000001
, a major problem. There are a number of ways to get
around this. For this scheme we choose to have the bit before the
actual encoding starts be indicated by a 1. So 01
would be
stored (in 8 bits now) as 00000101
while 1
would
be stored as 00000011
.
So in binary we have:
0000 0011 // header indicating we have 3 dictionary entries 0000 0110 // header indicating the last character uses 6 bits. 0100 0001 // ASCII for A 0000 0011 // encoding for A. Really "1" 0100 0010 // ASCII for B 0000 0101 // encoding for B. Really "01" 0100 0011 // ASCII for C 0000 0100 // encoding for C. Really "00"And from the header we can tell that the remaining information is the data section.
1110 1010 // AAABB and 1/2 of the next B 1111 0000 // The rest of the B, AAAC, and two extra zeros // We know the last two zeros aren't an extra char because of the // header that said the last char uses 6 bits // (111100)Notice that both the old and new file took up 80 bits. Also notice we did not include the 4 byte identifier required in the specifications section... Still, it was a very small file and we would hope to do better on a larger file.
You do not need to (nor should you!) use the exact scheme shown above. It won't work for larger encodings. But it should provide some ideas. (You can use a scheme as close to the above as you wish.)
The encoding scheme you choose should make a reasonable use of space. Remember the goal is to compress the file size. Some waste (especially in the dictionary and headers) is acceptable.
Your encoding and decoding schemes must have a reasonable run time. Don't worry about making it go really fast. Just don't do anything that is terribly inefficient.
-LZY
flag. This should include a
short description each major step taken by your algorithm (including
reading and writing the file.) For each of these steps provide a tight
bound for the big-O complexity.
Do all of your work, with all needed files, in some directory other than your home directory. Before you turn in your code be sure that:
Makefile
make clean
actually does remove the executable and the tmp
files (such as .o files)
make
will generate an executable called
my_compress
/afs/engin.umich.edu/class/perm/eecs380/bin/submit380 3
DIRNAME
The simple autograder will be turned on by midnight on Monday April 9th. It will be announced on the newsgroup.
my_lz.txt
file. 15 points.
You will get zero points for the project if you create any files other than the specified output file!
It is not our intent to grade for coding style. However your code should be fairly readable as we will need to review it by hand (to make sure you are implementing the right algorithms!) If we can't understand what is going on we may remove points!
Manipulating the individual bits of the characters to be written may be one of the more difficult parts of this assignment for many of you. Make good use of the shift operators, bit-wise operators and/or STL's bit-vectors.
Reading and writing binary files in C++ is explained in handouts. If you didn't do it for HW5, it would be a good idea to play with this a little bit before you start doing the actual program.
As a general rule, for large files you should be within a factor of 50 of
the runtime of the Unix utilites compress
or gzip
on the same file. The basic goal is to not take forever.