Assignment #15
Provide a short discussion of each of the assigned papers (listed under Course Materials). Below are questions to think about (you should discuss at least some of these but
you don't have to address them all):
Markov Games
Questions
- Why is it that in games such as checkers, backgammon, and Go, "the
minimax operator (in minimax Q) can be implemented extremely
efficiently."
- What are some real-life examples of Markov Games?
TD-Gammon
Questions
- Compare TD(0), that is TD(lambda) for lambda = 0, to Deep Q-learning (without experience replay). It might be useful to skim Chapter 6 of Sutton and Barto's book (2nd edition, available on-line).
- What is the role of non-zero lambda? Is there a connection to experience replay?
- Why can TD-Gammon get away with estimating a value function (called "equity" here) instead of a Q-value function?
- What arguments does Tesauro give for why self-play works in this domain? Discuss this in light of Littman's discussion in the Markov Games paper.
AlphaZero
You may find that reading the
AlphaGo Zero paper gives you a better view of what is actually going on (this
cheatsheet is also useful). Note the changes between Alpha Go Zero and Alpha Zero outlined at the start of the Alpha Zero paper.
Questions
- Discuss the following quote from the AlphaGo Zero paper:
"Monte-Carlo tree search (MCTS) may also be viewed as a form of
self-play reinforcement learning. The nodes of the search tree contain
the value function for the positions encountered during search; these
values are updated to predict the winner of simulated games of self-play."
- Compare and contrast Alpha Zero and TD-Gammon.
- The authors claim AlphaZero learns "tabula rasa". To what
extent is this true? What is baked in?
Upload a single PDF file through Canvas by
April 9th at 1 pm.