Problem
1. An ?-greedy method always selects a random action on a fraction, ?, of the time steps. How about the pursuit algorithm? Will it eventually select the optimal action with probability approaching certainty?
2. For many of the problems we will encounter later in this book it is not feasible to directly update action probabilities. To use pursuit methods in these cases it is necessary to modify them to use action preferences that are not probabilities, but which determine action probabilities according to a softmax relationship such as the Gibbs distribution. How can the pursuit algorithm described above be modified to be used in this way? Specify a complete algorithm, including the equations for action-values, preferences, and probabilities at each play.