Problem
Predicting Delayed Flights (Boosting). The file FlightDelays.csv contains information on all commercial flights departing the Washington, DC area and arriving at New York during January 2004. For each flight there is information on the departure and arrival airports, the distance of the route, the scheduled time and date of the flight, and so on. The variable that we are trying to predict is whether or not a flight is delayed. A delay is defined as an arrival that is at least 15 minutes later than scheduled.
Data Preprocessing. Transform variable day of week info a categorical variable. Bin the scheduled departure time into eight bins (in R use function cut()). Partition the data into training and validation sets. Run a boosted classification tree for delay. Leave the default number of weak learners and select resampling. Set maximum levels to display at 6, and minimum number of records in a terminal node to1.
a. Compared with the single tree, how does the boosted tree behave in terms of overall accuracy?
b. Compared with the single tree, how does the boosted tree behave in terms of accuracy in identifying delayed flights?
c. Explain why this model might have the best performance over the other models you fit.