Programming Project Sorting to find anagrams
See Project 4 on pg. 869 for the basic ideas of this project. We will find the longest anagrams in the words.txt provided in the Chapter 13 files on the author's website.
1. First we need to compute the canonical form of each word. The canonical form of a word contains the same letters as the original, but in sorted order. So compose a static method
public static String canonicalForm(String word)
so that canonicalForm("computer") returns "cemoprtu", and canonicalForm("poor") returns "oopr". For this first step, put this method in a little program CanonWord.java that inputs one word and outputs its canonical form:
java CanonWord
Enter word: computer
cemoprtu
To implement canonicalForm, program a loop that unloads the individual characters in the argued string and puts them in an array of char or ArrayList. ThuscanonicalForm("poor") gets an array of 4 chars ‘p', ‘o', ‘o', ‘r' or an ArrayList with the corresponding Characters. Then sort that array or ArrayList. Finally, use the sorted characters to build a new String by a loop of s = s + c, where s is a String, and c is a char or Character.
The program CanonWord needs a main method as well as the canonicalForm method. All it does is use a Scanner to get the word from the user and call canonicalWord to convert it, and print out the resulting canonical form.
2. Now write a program FindAnagrams that (in its main method) reads words.txt, and for each word there, adds it into an ArrayList words and its canonical form into an ArrayList codes. It obtains the canonical form by calling the same canonicalForm method you developed for CanonWord. Sort the codes ArrayList. Then hunt for duplicates in the sorted ArrayList codes. Any duplicates there will indicate anagrams in the original words. Keep track of the code with the most duplicates, and how many duplicates there are for it that you have found so far as you scan codes. Finally using this most duplicated code, scan the words list for words that have this canonical form and output them. Don't worry about ties at the longest length, just report on first anagram group with the most duplicated codes.
Test case (this is just an example, the actual words.txt you will be working with is MUCH longer):
words.txt codes sorted codes
rat art abt
hears aehrs abtt
share aehrs aehrs
tar art aehrs
bat abt aehrs
batt abtt art
shear aehrs art
The triplet of codes "aehrs" means that the corresponding words (hears, share, and shear) are anagrams. Similarly the duplicate of codes "art" indicates that rat and tar are anagrams. We are interested in the most duplicated codes, so the program should output as follows:
java FindAnagrams
hears
share
shear
We see there are two anagram groups here, one with code aehrs and the other with code art, so the most duplicated code is aehrs, corresponding to words hears, share, and shear.
For an example of comparing words next to each other in a sorted list, see line 34 of Vocabulary1.java, pg. 684, or line 30 of the program on the textbook's site.
Note that we are assuming that all the words in words.txt are in the same case, uppercase or lowercase. The textbook's words.txt is all in lowercase.
3. Write a narrative that describes how you went about designing and implementing these modifications, how you tested your program. Did you use the test case during development or did you use the final words.txt from the start, or what? Did you find it helpful that the canonicalForm computation was separately developed? Report on what you found to be the longest anagrams in the author's words.txt from the Chapter 13 area. Put your discussion in a text file called memo.txt.
Delivery:
in your it115/p4 directory:
CanonWord.java (with main and canonicalWord methods)
FindAnagrams.java (with main and the same canonicalWord method as in CanonWord)
memo.txt
in class, at the beginning of class:
Paper copies of memo.txt and p4.grade_sheet.htm, both of which should have your name on them. memo.txt should be stapled or paper-clipped together and also have the assignment name on it (p4).