COMPSCI220S2C2012 Assignment 1
due 8 August 8:30 pm
This assignment introduces you to an example of applied algorithm analysis and shows you why,
sometimes, compromises in an algorithm are necessary to make it useful in practice. We also cover
some aspects of space complexity – a topic we are not covering in the lectures that is nevertheless quite
important.
The family of algorithms you will get to know in this assignment are known collectively as the “Lempel-
Ziv” family (or “LZ” for short), after the inventors of the fundamental algorithm, Abraham Lempel and
Jacob Ziv. The usual application of Lempel-Ziv algorithms is for lossless data compression – making files
smaller if possible without actually losing any information in the process (i.e., we want to still be able to
recover the original file from the compressed version).
You would have used an algorithm of the LZ family if you have ever downloaded, created, or opened a
“.zip” file. In fact, you may have used an LZ algorithm just by browsing the web: Many web servers
compress their pages with a Lempel-Ziv algorithm before sending them to your web browser. LZ
compression algorithms have been widely implemented on many platforms (“.z” and “.gz” files under
Unix/Linux are also LZ-compressed, for example).
Lempel and Ziv’s “original” algorithm, however, published in the IEEE Transactions on Information
Theory in 1976, was not a compression algorithm. Its purpose was the parsing of a string for repeating
patterns in an effort to estimate the string’s complexity - but that’s a story for another day. Its
importance for this assignment is that it is the “grandfather” of the LZ algorithms, and it is the first
algorithm you are to implement here. It’s also easily converted into a compression algorithm, even if this
is not our main focus here.
Searching for known patterns
Picture yourself in a restaurant. Imagine they charge not by what you eat but by how many letters there
are in your order. There are two ways of ordering your dish:
1. “I’ll have the fillet of mountain goat1 on wine-soaked oakberry compote with rice, and I’d also
like a green salad”
2. On seeing that your neighbour has the dish you want, you point your finger at her and say “I’ll
have what she’s having and I’d also like a green salad”
The first order is obviously the more expensive one, whereas the second one uses a reference to an
existing order (pattern) and is shorter and cheaper. LZ compression uses the same basic idea: Express
1 Note to UofA Animal Ethics Committee: For illustrative purposes only. No actual mountain goats shall be used in
this assignment or have been harmed in the preparation thereof.
the contents of a file as a series of references to existing patterns. Each reference also goes along with
an innovation that extends our set of known patterns – your green salad (it’s what allows your friend
who orders next just to say “I’ll have the same” rather than having to say “I’ll have the same and also the
green salad”. In an LZ algorithm, we describe a file (a sequence of bits, basically) in a series of steps,
each of which consists of a reference to a sequence of bits (pattern) that we can find in the part of the
file that we have already described via the previous steps, and an innovation – an extra bit that we
haven’t seen in combination with the referenced pattern before.
Here is how it works: Imagine we have a file (of bits) with the content:
011010001001101111011100110011101001111110
Let’s further assume that the bit positions are counted 0,1,2,3,4,... from the left, i.e., bit 0 is a 0 and bit 4
is a 1 here.
The original LZ algorithm defines one variable that will accompany us through all steps: the
parsingPosition. It also defines a list that records for each step:
? the longestPatternStartPosition as the bit position to the left of the
parsingPosition where the longest bit pattern starts that also simultaneously starts at
parsingPosition
? the longestPatternLength, describing the length of that pattern
? the innovation, the first bit after the pattern to the right of the parsingPosition
The longestPatternStartPosition and longestPatternLength together form the
reference to a “known pattern” in each step.
Let’s work our example:
We start with step 0. Initially our parsingPosition is 0, and because there is nothing to the left of
the string, longestPatternStartPosition and longestPatternLength are also both 0.
The innovation is our first bit, a 0:
? longestPatternStartPosition = 0
? longestPatternLength = 0
? innovation = 0
Now as we move to step 1, we increase our parsingPosition by longestPatternLength + 1.
So parsingPosition is 1. We now look to the left of parsingPosition, but since there is no 1
there, longestPatternStartPosition and longestPatternLength also both remain 0.
Only our innovation is different this time:
? longestPatternStartPosition = 0
? longestPatternLength = 0
? innovation = 1
As we move to step 3, we again increase our parsingPosition by longestPatternLength + 1.
So parsingPosition is 2. We now look to the left of parsingPosition, and find one bit string
starting at longestPatternStartPosition 1 that is the same as the pattern that starts at
parsingPosition. It’s only one bit long, though, so we get:
? longestPatternStartPosition = 1
? longestPatternLength = 1
? innovation = 0
The 0 for the innovation is the bit at position 3. So far, we’ve parsed
011010001001101111011100110011101001111110 as
(0)(1)1(0)10001001101111011100110011101001111110
where the parentheses mark our innovations. Moving to step 4, we again increase our
parsingPosition by longestPatternLength + 1, so the parsingPosition moves from 2
to 4. Starting at position 4, we see the bit string 100010... ahead to the right. Looking to the left, we
determine:
? longestPatternStartPosition = 2
? longestPatternLength = 2
? innovation = 0
011010001001101111011100110011101001111110 as
(0)(1)1(0)10(0)01001101111011100110011101001111110
Step 5 gives, from parsingPosition 7:
? longestPatternStartPosition = 3
? longestPatternLength = 4
? innovation = 1
011010001001101111011100110011101001111110 as
(0)(1)1(0)10(0)0100(1)101111011100110011101001111110
Step 6, from parsingPosition 12:
? longestPatternStartPosition = 2
? longestPatternLength = 3
? innovation = 1
011010001001101111011100110011101001111110 as
(0)(1)1(0)10(0)0100(1)101(1)11011100110011101001111110
Step 7 gives, from parsingPosition 16:
? longestPatternStartPosition = 11
? longestPatternLength = 6
? innovation = 0
011010001001101111011100110011101001111110 as
(0)(1)1(0)10(0)0100(1)101(1)110111(0)0110011101001111110
Note that this is a special (but perfectly functional) case that lets the pattern that starts to the left of the
parsingPosition extend to parsingPosition and beyond where the known pattern does a
“nose-on-tail” with the new pattern.
Exercise (not to be submitted): work out the parsing for the rest of the bit string. You should get:
(0)(1)1(0)10(0)0100(1)101(1)110111(0)0110(0)11101(0)01111(1)10
You’ll notice that there is no innovation at the end of the file, simply because the 10 there is a wholly
known pattern – there is no need for an innovation. This effect is also a characteristic of LZ parsing: We
may or may not have the algorithm terminate with a full step.
Your first task (20 marks)
Your first task in this assignment will be to implement the algorithm above. It must take text files as
input in which each character is either a ‘0’ or a ‘1’, representing the bits in a binary file. Your program
must ask for the file name to be entered and must output the final parsing in the format of the string
above (or similar), so the innovations are shown in parentheses. The markers will need this for testing.
To test and debug your algorithm, consider outputting the intermediate steps as well. As test data,
download files from:
https://www.cs.auckland.ac.nz/~ulrich/COMPSCI220/fileMaker.php.
Start with short files – the long files are for later. The entropy is a measure for how “predictable” the
content of the file is: low entropy files contain mostly 1’s and very few 0’s, whereas files with entropy 1
contain more or less random looking data. For debugging, start with a file that has high entropy.
Note 1: You will later extend your program, so make sure you make a backup of this algorithm one it
works for you. Keep it in a safe place so you can revert to it if you run into trouble when you extend the
program.
Note 2: While it is strongly recommended that you attempt this task first, you may wish to attempt the
non-programming tasks of this assignment with priority if you find yourself under time pressure and
return to this task later. Where experimental work is required, you may use the example application from
the assignment web page instead of your own implementation.
In a moment, we will investigate the Big-Oh time complexity of the algorithm above. You will then in
comparison look at a variant of the algorithm above known as LZ77.
Data Compression with the LZ algorithm
You may look at the algorithm above and ask: “Where’s the compression here?” Fair comment – so far,
we’ve parsed into steps but we haven’t actually compressed anything. To compress, we must encode
the list of steps we have derived. In doing so, we must encode for input of length n:
1. The innovation of our first step: one bit
2. The longestPatternStartPosition of our second step: one bit as the start position
could be 0 or 1.
3. The longestPatternLength of our second step: because of the “nose-on-tail” effect, this
could be up to n-1, so we need log2(n-1) bits for this.
4. The innovation of the second step: one bit
5. …
6. The longestPatternStartPosition of our i’th step: this could be any position between
0 and the one bit position before the i’th parsingPosition: log2parsingPosition bits are
needed.
7. The longestPatternLength of our i’th step: Since the pattern cannot be longer than the
length of the sequence between the i’th parsingPosition and n, all we need log2(nparsingPosition)
bits here.
8. The innovation of the i’th step: one bit
9. Finally, we need one bit to indicate whether the last step has an innovation or not.
With some luck, the total number of bits we need for this will be less than the number of bits in the
sequence. We can compute the compression ratio as the number of bits required for the compressed
encoding divided by the number of bits in the original file. If this ratio is smaller than 1, then the file is
compressible with the LZ algorithm.
Your second task (10 marks)
Extend your program so it also computes and outputs the number of bits after compression. This
involves adding up the required bits across all steps using the considerations above. You do not have to
implement the compression encoding itself, but your implementation should show the compression
ratio in its output.
Theoretically, the lower the entropy of a file, the more compressible it should be. Try the algorithm with
a few longer (say 100,000 symbol) sample files of different entropies. You should notice that the
compression ratio will be greater than 1 for the higher entropies (over 0.5 bits/bit) – meaning the basic
LZ algorithm above cannot compress these files.
Note: You do not have to submit the results of these experiments (they carry no marks). Make a note of
them as they will come in handy later. Again, you can defer the implementation of the code in this task if
you are under time pressure and use the application provided for your experiment.
The time complexity of the LZ parsing algorithm
Casting complex data structures such as suffix trees (which are beyond the scope of 220) aside, we can
assume that we will find longestPatternStartPosition and longestPatternLength in
each step by naïve string matching. That is, we will compare from each potential start position and the
parsingPosition how many identical symbols we can find, progressing symbol by symbol until the
symbols no longer match, then advancing our start position and updating
longestPatternStartPosition and longestPatternLength as we go. We will work with
the following basic observations:
? Our input size will be n bits.
? Comparing two symbols from a binary alphabet is an elementary operation, i.e., O(1).
? Setting, incrementing, and comparing length and position values is strictly speaking an O(log n)
operation. However, in practical implementations such as the one you will attempt here, this
complexity is absorbed by the computer’s architecture, and we can treat these operations as
being O(1) for our purposes (as we would do, e.g., in the context of a sorting algorithm).
? Identifying the innovation is also O(1), as is the encoding of each step.
? The algorithm must be at least O(n), as we apply at least one of the operations above to each of
the n symbols in the input.
Consider a parsing step where the longestPatternStartPosition is close to
parsingPosition but almost every search start position before
longestPatternStartPosition results in a long match as well. E.g. consider the following
simple file: 000000010000000000000000. The initial parsing steps are as follows:
1. (0)00000010000000000000000
2. (0)000000(1)0000000(0)00000000
The underlined bit at position 8 here is the longestPatternStartPosition for the next step,
and will match all our remaining bits, but we won’t know this until we’ve tried and found out that we
can match seven bits from position 0, six bits from position 1, five from position 2, and so on. We can
create arbitrary many similar examples by inserting a ‘0’ after each of the innovations above. The
general form of the example is: [m 0’s]1[(2m+2) 0’s] for arbitrary integer values of m.
Your third task (20 marks)
As the third task of your assignment, show that this example implies m = ?(n), m = O(n), and that an
implementation of the LZ parsing algorithm with naïve string matching has a time complexity of at least
O(n2). In your proof, you may want to use the fact that the sum of all integers from 1 to k is k(k+1)/2.
Submit a PDF file with your proof.
Hint: You may find it helpful to consider that a polynomial of the form xn2 + yn + z is ?(n2) and thus O(n2),
and that applying this to three different values of n constitutes a system of linear equations. If you do
not know how to solve a system of linear equations, interactive solvers can be found on the web.
Your fourth task (10 marks)
Extend your algorithm implementation such that it counts the number of elementary operations that
the algorithm performs. Do not count the operations associated with computing the bits needed for
encoding. Count only those elementary operations associated with the actual parsing.
Note: The number of elementary operations output by the sample application is not a must to achieve –
if the difference between your result and the sample application for the same file is small across different
file length, then this is not a problem. The overall trend and behaviour of the result is what matters.
Experimentally confirm that your algorithm indeed behaves like an O(n2) algorithm for files of the
format [m 0’s]1[(2m+2) 0’s] of up to 1,000,000 bytes in length. You can download such files from the file
maker URL above.
Also, show (by means of at least four files each) that this conclusion also holds for files of low (0.1
bits/bit), medium (0.6 bits/bit) and high (1 bits/bit) entropy.
Include your experimental data and reasoning in your PDF file for the 3rd task. (Again, you may initially
substitute the sample application here for your experimentation, but half of the marks for this task
depend on your final application outputting credible numbers of elementary operations as it
demonstrates your ability to identify the operations that matter!)
Reality check
An algorithm which runs in O(n2) is not necessarily impractical even for large n if one of the two
following conditions is met:
1. If inputs for which the quadratic bound is reached are not expected in practice
2. If the coefficient associated with the n2 in the polynomial that describes the upper bound on
the algorithm’s running time is so small that the running time is acceptable for all inputs of
practical length.
In the case of the “basic” LZ algorithm, your experiments should have demonstrated that neither applies
here: Even on your “randomly” generated files, you will find that the growth in elementary operations
looks uncomfortably close to being quadratic in n. The growth is also rather evident even across our
length range of files, which is quite within the range of the sort of files you would expect on a practical
computer. You probably noticed that the “one million bit” files took several minutes to process. Now
last time you put a file that size into a Windows compressed folder, it didn’t take nearly that long, right?
The main problem here is that we’ve got to go back to the beginning of the file to search more and more
possible start positions for our longest patterns every time we move our parsingPosition forward,
and at the same time our possible longest patterns also grow in size.
Moreover, if you open one of the entropy files for 0.6 bits/bit, which we couldn’t compress, you’ll find
that it has lots of repeated patterns in it. So it’s not for lack of repetitions that we can’t compress them.
Our main problem is that many of our repeated patterns are short compared to the length of the file
before parsingPosition and in many cases occur more than once before parsingPosition.
Yet in each step we provide sufficient bits to encode a longestPatternStartPosition and
longestPatternLength value that could be anywhere within a huge range. In most steps, many of
these bits have the same value 0, because the pattern occurs early in the string and isn’t overly long.
So there’s got to be a better way. There is. More than one, in fact, but we’ll only look at one here: LZ77.
The LZ77 algorithm
Don’t panic. You’ve done most of your assignment. LZ77 is just the icing on the cake, and actually just a
small modification to the basic LZ algorithm. All we need to do is two things:
1) Restrict where we may search for longestPatternStartPosition: We now demand
that this must be within a fixed windowSize before parsingPosition. E.g., if
windowSize is defined as 100 and parsingPosition is 9457, then the
longestPatternStartPosition must be between 9357 and 9456.
2) Restrict the longestPatternLength, e.g., to windowSize as well.
The basic algorithm thus becomes a special case of LZ77 with a windowSize of n.
Your fifth task (10 marks)
Extend your algorithm so that it allows a windowSize to be input and convert it into a LZ77. Your
algorithm should query the window size after the input file name and should default to the basic LZ
algorithm (windowSize = input length).
Your sixth task (20 marks)
Show that your implementation of LZ77 is O(n) for a fixed windowSize. This argument should be
based on theoretical considerations only and not involve experimental evidence. Then add suitable
experimental evidence as confirmation that your implementation does not contradict your theoretical
proof. Include this proof in your PDF submission. The experimental evidence is worth 5 marks of the 20
marks for this step and to earn the marks, the evidence must come from your own LZ77 implementation
(you cannot substitute the sample application here).
LZ77 compression
If we try to encode each step in LZ77, we now find that we can get away with using fewer bits. In
particular:
1. Encoding the longestPatternStartPosition of our i’th step: This required
log2parsingPosition bits in the plain LZ algorithm because the pattern could start
anywhere between the beginning of the file and parsingPosition-1. However, now it must
start within our window, and this means we can encode the
longestPatternStartPosition as an offset of between 1 and windowSize from the
parsingPosition. This requires log2windowSize bits only (or even less for small i where
the window starts before the beginning of the file).
2. The longestPatternLength of our i’th step needed log2(n-parsingPosition) bits to
encode as the pattern could conceivably cover the entire rest of the file at each step. Now we’re
restricted to a length of at most windowSize symbols, so log2windowSize bits will do.
In LZ77, we are of course (a) likely to miss the actually longest patterns and (b) cannot have patterns
longer than windowSize. This means we cannot have steps of the same length as before and will
generally need more steps to parse the string. This in turn means we need to encode more steps, which
costs us extra bits.
Your seventh task (10 marks)
Demonstrate (experimentally) that LZ77 is capable of compressing better than the basic LZ algorithm.
For this, you will need to modify your implementation to take the modified encoding sizes into account
if the windowSize is less than the length of your input.
Your argument could involve either showing that previously incompressible files are compressible with
LZ77, or that previously compressible files need fewer bits to encode under LZ77, or both. Comment on
suitable window sizes, whether the best compression for a file depends on the entropy of the file, and if
so, how the optimal window size relates to the file’s entropy. Full marks are available if you are able to
produce the evidence with your own implementation (half marks if you are substituting the sample
application).
All done, now all you need to do is submit!
Getting started
Read the example above and download the sample application and familiarise yourself with the way the
basic LZ algorithm works. WAIT WITH ANY OTHER TASK UNTIL YOU HAVE DONE THIS.
As mentioned above, you can use the sample application for much of the experimental work in this
assignment, however ideally you should use your own implementation. You may wish to skip your own
implementation or part of it if you are running out of time – only half the marks are for results you can
achieve by programming.
The order in which you complete the tasks depends on you, however I recommend that you complete
them in sequence.
Getting to the programming
For your convenience, a skeleton application will be provided that allows you to input a file name and a
window size, which will default to the size of the input if you do not specify a window size. Do not
modify this part of the application, implement your LZ parser in the lz77() method. The skeleton
application consists of four classes: the application class, the program class, the Keyboard class which
you know from COMPSCI101, and an LZ77Step class which you can use in a linked list to store the
parsing steps of your program (you can, but you don’t have to – it’s possible to implement without this
class).
When programming, implement the basic LZ algorithm first (first task). Once it works, make a backup of
your files before you proceed to the next step.
When computing logarithms, use the Math.log() method and remember that to get a logarithm to
base 2, you need to divide the result by Math.log(2), like so:
// This prints “The logarithm of 8 to base 2 is 3”
System.out.println(“The logarithm of 8 to base 2 is ” + Math.log(8)/Math.log(2));
To get whole numbers of bits, you may want to use the Math.ceil() method, which rounds its argument
up to the nearest higher integer:
System.out.println(Math.ceil(5.43)); // prints 6
When computing your number elementary operations, include only the elementary operations
mentioned, and only in lz77(), not elsewhere, and not those related to computing the number of bits
needed for the encoding, or the operations involved in computing the number of elementary operations
themselves. For the purposes of this assignment, do not combine sequential elementary operations into
a single one as you would for ascertaining the computational complexity, i.e.:
// Count the following as two elementary operations
x++; // first elementary operation
y--; // second elementary operation
Example (not from my code) of how to account for elementary operations:
// Upcoming comparison of positions x and y, one elementary operation
elementaryOperations++; // this addition counts the == , it is NOT an elementary operation itself
if (x == y) {
x++; // x is a position, so this is an elementary operation, account for it in next line
elementaryOperations++;
z = x; // z is also a position, so account for this elementary operation
elementaryOperations++; // account for assignment operation in previous line
// compute minimum number of bits required to store z in binary form
System.out.println(“Bits for z: ” + Math.ceil(Math.log(z)/Math.log(2))); // not an elementary op
}
To help you debug, feel free to insert additional printouts into your code while you program, or use
Keyboard.readInput(); to pause the program until you hit return. Before you submit, remove these and
test your program thoroughly. Your final submission should only include one file, the LZ77FileInput.java
program file.
Doing the “theory” part
In this part, you will need to argue your case in your own words, using the right mathematical notation if
and when appropriate. Your grammar and spelling are less important than the clarity of your logical
argument. Any experimental examples must reflect your own personal entropy files and your own
implementation of the code (if applicable).
You can produce your PDF file in Word, Powerpoint, LaTeX or any other package you prefer. Important:
Before submission, view your file on a lab computer and ensure that all symbols are visible. This is
especially likely if your machine has an overseas version of Windows installed that uses fonts that our
lab machines do not have. If you find symbols missing or replaced by unreadable ones, go to your
original machine and print the PDF document with fonts embedded. To do so, look at your printer
properties and ensure that anything that says “use system fonts” or “rely on system fonts” or
“substitute unknown fonts by system fonts” or such is unticked and anything that says “download as
soft fonts” or similar is ticked. You can check your document in Acrobat by looking at the properties:
Select the Fonts tab and look at the list of fonts. Scroll down and ensure that all fonts are either marked
as “Embedded” or “Embedded Subset” – if you have a font that isn’t embedded, it’ll look funny!
Getting help
Read the Q&A below, if that doesn’t help, ask the tutor or e-mail me: [email protected]
Submission
Submit your LZ77FileInputProgram.java and your PDF file with your other results via the
assignment dropbox on Cecil. Files with other names, or names with different spelling or capitalisation
may not be marked.
Please ensure that your submission reflects your own personal entropy data (i.e., do not accidentally
submit something for another AUID). Late submission after 8.30 pm on the 19th will attract a penalty
unless Sonny or I have given you an extension beforehand – and “beforehand” means applied for no
later than the 18th, not half an hour before deadline ;-)
For completeness we are also required to point out the University’s academic honesty policy to you:
https://www.auckland.ac.nz/uoa/cs-academic-honesty/
Questions and Answers
Q: I’d like to use some Java data structures other than you have provided, from a common package. Is
this OK? A: Please don’t – there is no guarantee that the marker’s environment will have your package
installed.
Q: Can I use my own Java classes? A: No. Please use the ones that are provided.
Q: Can I rename LZ77FileInputProgram? A: For your own testing, yes, but once you submit your
class must be called LZ77FileInputProgram and must be in a file called
LZ77FileInputProgram.java.
Q: Should there be anything else in the PDF file other than the plot with the curves for the number of
steps for the various algorithms? A: No, but if you would like to include any explanations for the marker
then that is OK.
Q: Why do I need to download personalised files? A: This is so we can ensure nobody copies anyone
else’s work ;-)
Q: The numbers I get for the elementary operations are a little different from your sample solution. Is this
a problem? A: This could be brought about by me or you not accounting correctly for each and every
operation, or counting operations as part of the algorithm that are actually part of the algorithm’s
“accounting”. Possible causes could be:
? Failing to account for the identification of the innovation
? Failing to account for elementary operations in the conditions of if-statements, while- or forloops:
every operator such as +, -, ==, >, &&, ||, etc. represents an elementary operation here!
? Using a string comparison or substring extraction method and counting it as a single elementary
operation of O(1) when in fact the complexity is of O(lengthOfString).
Otherwise if the difference is really small (i.e. if you get something like 979,355 when I get 982,421) and
the overall order of the algorithm’s behaviour is the same, then this is negligible and won’t cost you any
marks.
Q: Can I submit separate applications for the basic LZ algorithm and the LZ77? A: No.
Q: What would a good proof or experimental evidence in the theoretical part look like? A: What would
convince you that a claim I make is true? What do you think will convince me that a claim you make is
true? Think about what is needed and be creative! Any sound argument will be accepted, but a simple
repetition of the claim won’t earn you marks (even if it’s repeated three times – that’s handwaving ;-)).