Problem
1. Describe how the actor-critic control method could be combined with gradient-descent function approximation.
2. Look up the paper by Baird (1995) on the internet and obtain his counterexample for Q-learning. Implement it and demonstrated the divergence.