Problem
Consider a simplified supervised learning problem in which there is only one situation (input pattern) and two actions. One action, say a, is correct and the other, b, is incorrect. The instruction signal is noisy: it instructs the wrong action with probability p; that is, with probability p it says that b is correct. You can think of this as a binary bandit problem if you treat agreeing with the (possibly wrong) instruction signal as success, and disagreeing with it as failure. Discuss the resulting class of binary bandit problems. Is anything special about these problems? How does the supervised algorithm perform on this type of problem?