Hand in: A printed report about your experiments. Electronic copies of your report, source and parameter files (online hand-in). Use Excel to create performance graphs, to be included in your report. Submit electronic copies of all data and report files using the "submit5p71" script on sandcastle. It will submit the entire directory substructure of your assignment.
System: You are free to use any GP system you want (ECJ, lilGP, Open Beagle, RobGP, etc.).The system that will be discussed in class is lilGP 1.1. This is a C-based GP system with the same functionality as Koza's GP system in his first and second books. It also has many additional enhancements (strong typing, multiple populations, etc.). However, ECJ has been the most popular choice for students, as it is more contemporary, better maintained, has more features, and is documented. See the end of the assignment for the location of these systems.
A. Symbolic regression
This question will introduce you to genetic programming. You are to take an existing symbolic regression example, compile it for the platform you are using, and execute it to get some results. The lilGP, ECJ, and Open Beagle systems all include this example. Once you get it running, modify the training initialization setup, by reading in the training points from a text file. You should have the total number of points to process and the name of the text file as parameters in the "input file" user parameter file. This will let you run your system on new data sets, without having to recompile the GP system. (Note that this modification will be useful for part B below).
Try the system a few times, on new data you have put into a new training file. Also try it on a few variations of GP languages. For example, try it on the default language, as well as a stronger language (add new functions).
Decide on two comparative experiments to do, using the same training data. You might compare the GP language variants, or selection strategy, or fitness evaluation scoring. Do 10 runs using each of these experimental variations. Your report should compare these two experiments. A summary table consisting of the average over 10 runs of all the population averages, and average "best" at the end of each run, should be computed for each experiment, and put into a table. Performance graphs showing the overall run averages of the population average and best per generations should also be plotted in Excel.
B. Auto MPG Data Set
The Auto MPG data set is a famous classification problem from the UCI Machine Learning Repository:
https://archive.ics.uci.edu/ml/datasets/Auto+MPG
You are to use genetic programming to evolve a rule that predicts the fuel consumption (mpg, miles per gallon) for a given var. The data set has 398 cases, all with the following numeric attributes:
1. mpg: continuous
2. cylinders: multi-valued discrete
3. displacement: continuous
4. horsepower: continuous
5. weight: continuous
6. acceleration: continuous
7. model year: multi-valued discrete
8. origin: multi-valued discrete
9. car name: string (unique for each instance)
Here are three examples from the database, for three different cars:
25.0 4 113.0 95.00 2228. 14.0 71 3 "toyota corona"
25.0 4 98.00 ? 2046. 19.0 71 1 "ford pinto"
19.0 6 232.0 100.0 2634. 13.0 71 1 "amc gremlin"
Attributes 1, 3, 4, 5, and 6 are continuous (floating point); attributes 2, 7, 8 are discrete (integers). Attribute 9 is a string of the car name. You should ignore this value, as it will not be useful for GP to use. (You can load the data into Excel, and delete that column). Once this is done, then all values in the data are numeric (with an exception mentioned below). Since all the values are numeric, it is not necessary to use strongly typed GP. (However, strong typing can still be useful.)
Attribute #1 is the MPG for the car. It is the "answer value" for a car example. We want to evolve a GP program will evolve an expression that uses attributes 2 through 8, to determine attribute 1. For example, if we supply an evolved GP expression with the attribute values for the Toyota Corona, we want the GP program to return a value close to 25.0. Of course, do not give your GP tree expression access to the MPG value, because it is the solution (that would be like giving students an exam key during an exam!).
The Ford Pinto example has a missing horsepower ("?"). There are 6 examples in the database that have missing values like this. Please delete these examples the database.
Here is a recommended approach to follow.
1. Read the database into a large table, one row per example. You should randomly shuffle the rows. Then split the table into 2 independent sets of examples - a training set, and a testing set. The exact size of these sets is up to you, and may have to be experimentally determined. Be aware that machine learning is usually affected by the size of training sets. For example, sometimes smaller training sets are more effective.
2. Define a set of language primitives to be used by GP. These primitives should work sensibly on the input data. A possible set of primitives to consider are:
a. functions: arithmetic operators, min and max, relationals, if-then-else,...
b. terminals: attribute variables (7), and ephemeral random constants.
3. The top-level wrapper that executes each classifier program works as follows. The 7 attribute values, from a row in the table, are copied into a set of 7 variables accessible by the GP expression, and which will take the form of leaf terminals within the tree. The program is executed by the GP system interpreter using these values. The resulting answer is recorded. The fitness function will then compare this answer to the known quality solution from the example table, and the fitness score will be appropriately updated (see next point). So in other words, the following pseudo-code shows the general execution strategy:
float cylinders, displacement, horsepower, ..., origin; int ans;
loop for i=0 to N-1 : // N training examples from table cylinders = training[i][1];
displacement = training[i][2]; horsepower = training[i][3];
(... etc. all the way to training[i][7])
computed_mpg = execute_tree(t); // t is the current GP tree
// now the predicted quality found in ans can be compared to
// real recorded quality of that example,
// perhaps stored in integer array ans[i] (i.e. training[i][0]).
4. Fitness is based upon how well the GP answers match the "real" mpg for cars as recorded in the database. There are a few options.
(i) Compute a sum of absolute error between the actual mpg and target mpg for a car:
∑ abs( computed_mpg - ans[i] ) This summation is computed over all the training examples.
(ii) Compute the sum of errors squared:
∑ (computed_mpg - ans[i]) ^ 2 This tends to punish large differences in predicted mpg.
Note that both these formula prefer smaller sums (an exact match would have a sum of 0.0).
5. At the end of the run, you must test the best GP solution by running it on the testing set (i.e. all the examples not used for training). This gives an idea of the generality of your evolved solution when given new data it has never seen. Record the test performance of each run. This information will be included in the report.
6. Do 10 runs, each with different random number seeds and shuffled data sets. Collate the performance of your runs together. Your report should have a table summarizing all the runs, for both the training scores and testing scores. You should report the average best solution scores, as well as the scores for the overall best solution from all runs. Include a confusion matrix with each experiment. Also include performance
graphs showing the population fitness and best fitness plotted over generations (averaged over for all the runs).
7. Your report should describe all relevant aspects of your experiment. The goal of the report is to give enough information so that someone else could reproduce your experiments. Describe your problem set (and its source), and describe what your GP program is attempting to do. Be sure to describe your GP language, all GP parameters, and your fitness evaluation methodology. Include source code of evolved GP programs for your 2 best rules obtained. If they are large, they can be put into an appendix of your report. Include a discussion on your results. Tables and graphs do not speak for themselves, and so you must describe and discuss the important features in them. Also include a conclusion section that summarizes the strengths or weaknesses of your experiment.
The report format should be:
1. Introduction: problem description
2. Experiment details:
a. parameters for GP
b. fitness function
c. GP language
d. format of data
e. variations of experiments
3. Results
a. tables of results (best, avg, etc.), confusion matrices
b. graphs
4. Analytical discussion of results
5. Conclusions
6. Bibliography (data set URL, references,...)
Please write separate reports for parts A Regression and B Auto MPG.
https://drive.google.com/file/d/0B3-8Q4yDXVU8SkNDRW4xT0gzWUk/view
Attachment:- Genetic programming.zip
Attachment:- Tutorial.txt