Problem
Consider the problem of learning a Bayesian network structure over two random variables X and Y.
a. Show a data set - an empirical distribution and a number of samples M - where the optimal network structure according to the BIC scoring function is different from the optimal network structure according to the ML scoring function.
b. Assume that we continue to get more samples that exhibit precisely the same empirical distribution. (For simplicity, we restrict attention to values of M that allow that empirical distribution to be achieved; for example, an empirical distribution of 50 percent heads and 50 percent tails can be achieved only for an even number of samples.) At what value of M will the network that optimizes the BIC score be the same as the network that optimizes the likelihood score?