You will be implementing a relatively simple text-compression scheme. your code should work as follows:
1. It should run from the command line, using the command:
java Compress
where the argument is the name of a text-file found in the same directory as the executable program. (This name is passed into your Java program as an argument, and will then be accessible from position 0 in the array args that is input to the main() method.
2. It should read through the text-file, word-by-word (using white-space to delineate words, as usual, so "words" may contain punctuation, etc.). For each word, it should count occurrences of any substring of 3 or more characters. For instance, if it reads in the word "googoo" it should count 2 occurrences of the substring "goo", 1 occurrence of "oog", and 1 occurrence each of "goog", "oogo", "ogoo", "googo", and "oogoo". Finally, it should count one occurrence of 6-letter substring, "googoo".
This counting should occur over the entire file, and it should be handle efficiently using a hash-map structure to map substrings to the number of times they occur. You can use a built-in Java structure here.
3. Once the counting is complete, the hash-map will contain a collection of substrings, each with a corresponding count value (1 or greater). You should then order these substrings:
- Create a new class that can store a substring and its count value.
- Make that class properly Comparable to other things of its type. In the compareTo ordering, each object should be ordered by its impact factor, which is the product of the substring's length and the number of times it occurs. For instance, if the substring "the" occurs 100 times, it has an impact factor of (3 × 100 = 300).
Two of your class objects should be ordered so that things with higher impact factor come before things with a lower impact factor. Thus, if "the" has impact factor 300, and "goo" has impact factor 100, then "the" comes before "goo" in comparison ordering.
- Once you have completed the class, place each of your substring and count value pairs into a priority queue, so that they will be in order, with highest impact factor at the front of the queue. Again, you can use a built-in Java structure here.
4. Now, you can generate a compressed encoding of the highest-impact substrings. For each of the first 52 such substrings (or al l of them, if your input doesn't contain 52 distinct substrings of length 3 or greater), you will encode each such string with a 2-letter coding, consisting of the plus sign '+', followed by a single character (in range a-z or A-Z).
For instance, if "the" is our highest-impact substring, then its encoding would be "+a".
Note 01: when generating codes, you should ignore all substrings that either contain, or are contained by, one you have already encoded. For instance, if you have already encoded "ther", then you should ignore any future substrings like "there", which contains it, or "the", which is contained by it. You should still attempt to generate 52 distinct codings, if that is possible after discarding overlapping substrings.
Note 02: for this to work, it assumes that the text-file does not actually include the special marker character, '+'; if this is not so, another special character would have to be used.
5. Your code should now read through the file again, writing out a compressed version to a second file. To do compression, read in each word again, and replace each substring in the word with its encoding, if such an encoding exists, and write the result.
For instance, if we read in the word "teethes", and we have an encoding of "+a" for "the", then we should re-write the word as "tee+as".
Note 01: if a word contains multiple possible substrings, then you should replace them all. When there are multiple choices, the replacement should happen by order of length of substrings, so that we get maximum compression of each word. For instance, if we have an encoding of "+a" for "the", "+b" for "theo", and "+c" for "log", then the word "theology" would first be encoded as "+blogy", since "theo" is the longest substring; it would then be encoded finally as "+b+cy", and written to the output file. If a word contains multiple possible substrings of the same length, remove them in the order given by Java String sorting (this means that every word has a uniquely defined coding).
Note 02: when writing to the output file, make sure that line-breaks and other white-space are preserved from the original file.
Note 03: the output file should have a name related, but not identical, to that of the original input. In particular, if the input is named "FILE.txt", then the compressed output file should be named "FILE_comp.txt" (where "FILE" can be any file-name).
6. The code above will somewhat compress a text-file, but we will not be able to de-compress it, unless we know the encoding used. Modify your code so that when it creates the output file, it first writes information about the compression into the first part of the file (in a format of your own choosing), and then follows it with the compressed text. Then, write a second program, which can also be called from the command line:
java Decompress
When called, this should open the named file, if possible, and (assuming it is in proper format), de-compress it, producing a new version of the file that is identical to the original input to the compression program.