Strings, Structs & Files (CSC100)
Name your C++ source code file: data.cpp
Overview of Problem: Read a text file to create an array containing one structured element for each unique word found in the file, as well as the number of times that word appears in the file regardless of case. Use this array to calculate and display the statistics described below.
A text file contains English text, such as essay.txt available as example input to your program for this assignment. Your program must read the contents of such a text file, store the unique words and count the occurrences of each unique word. When the file has been completely read and the array of unique word structs has been set, print the words in sorted (alphabetical) order and the number of occurrences of each word to an output text file. After this output file has been created and the list of words written to the file, the following statistics must also be written to the output file:
- Total number of words read
- Total number of unique words read
- Average length of a word (as a floating point value)
- Average occurrence of a word (as a floating point value)
- Most commonly occurring word(s)
You can see an example input file and its corresponding output file produced by your code here.
NOTE: Throughout this assignment (and this course!), when the data type, string, is mentioned, you are expected to use a c-string, which is an array of char. Do not use the name, string, as a data type directly. To make a variable a string, declare it as an array of char.
Processing Requirements:
- You will need to prompt the user for the name of the input text file. From this name you are to create the name of the output text file. The output text file should have the same name, but use the extension: .out. So if the input file is named: test.txt, then the output file should be named test.out. Do not prompt for the output file, create it using the input file name. The names of the files are strings, i.e., arrays of char.
- Close the input file when you have completed reading and close the output file when you have completed writing.
- In addition to the title at the beginning of the program and the prompts to the user for the name of the input file, print to the screen identification of the different steps happening during the processing. As there is no other output to the screen, this is helpful in identifying the progress of the program.
- Use an array of structs to hold the words and word counts. Define a struct to hold a string (array of char) and a count (integer). Your main function will then declare an array of these structs. Assume that a word has a maximum length of 20 characters ,so the string (array of char) size must be 21 to allow for the null character.
- All words stored in the array should be stored in all lowercase. So, if a word appears capitalized in one place and in lowercase in another place, these two occurrences count as two occurrences of the same word, not two different words. See handling of the words "This and "this" in the example output.
- Declare the array of structs in the main and pass it as a parameter to the other functions. Declaring this array as a global variable is not acceptable. No variables should be global, except for named constants.
- Break your code into meaningful functions that are short and perform only one, well-defined, easily-understood function.
- You must use the linear search algorithm to determine if a word is in the array. Remember that the array is an array of structures and that the key is a string (array of char) so the string comparison function, strcmp, must be used. The search task should be a separate function that accepts the array of structs, the number of items currently in the array, and the string value (a word) for which to search. The search function should return the position (index, subscript, an integer) of where in the array the value (word) is found or a -1 if not found. The search function must have only one return statement.
- Use the selection sort algorithm (given in the array module) to sort the array of structs . To use the given algorithm, you must change it from handling arrays of integers to handling an array of struct. The items being compared will be strings (arrays of char), which will require the use of the strcmp function.
- Use good programming style. You will, as usual, be graded on: documentation of the program and functions (each function should have its own comment prolog), names of variables, indentation of statements, correct use of loops, correct use of structs and functions and everything else that you learned about this semester!
- Functions are limited to a length of about 25 statements (not including declarations). Use good modularization criteria to divide longer functions into sub-functions.
- The example programs in this module demonstrate both reading from and writing to a file. A word is defined as one or more characters terminated by one or more spaces, the end of a line, or the end of a file. Your program should also remove any punctuation from the beginning and end of words. If punctuation is found inside a word (not at beginning or end, such as a hyphen), your program should consider that symbol as part of that word. All words must be stored in lower case since the case of words in the file must be ignored.
- You can assume a maximum of 500 UNIQUE words. If there are more than 500 UNIQUE words, print an error message (ONCE) and then continue reading the file to count words that have already been stored in the array.
- Here is an algorithm for the function to read the input file, count words, and store them into the array of structs:
while not end of file
read a word from file
search for this word in the array of structs
if the word is in the array
increment the count in the struct at the position where the word was found
else add the word to the end of the array and set the count to 1
end while
- The text file, essay.txt, may be used to test your program. However, you will want to begin your testing with very small text files containing only one or two words at first. Increase the length of the text file used for testing as you become more confident that your solution is working.
Turn in a single source code file named, data.cpp.