6.s953 Embodied Intelligence, Spring 2024

Assignment #15

Provide a short discussion of each of the assigned papers (listed under Course Materials). Below are questions to think about (you should discuss at least some of these but you don't have to address them all):

Markov Games

Questions

Why is it that in games such as checkers, backgammon, and Go, "the minimax operator (in minimax Q) can be implemented extremely efficiently."
What are some real-life examples of Markov Games?

TD-Gammon

Questions

Compare TD(0), that is TD(lambda) for lambda = 0, to Deep Q-learning (without experience replay). It might be useful to skim Chapter 6 of Sutton and Barto's book (2nd edition, available on-line).
What is the role of non-zero lambda? Is there a connection to experience replay?
Why can TD-Gammon get away with estimating a value function (called "equity" here) instead of a Q-value function?
What arguments does Tesauro give for why self-play works in this domain? Discuss this in light of Littman's discussion in the Markov Games paper.

AlphaZero

You may find that reading the AlphaGo Zero paper gives you a better view of what is actually going on (this cheatsheet is also useful). Note the changes between Alpha Go Zero and Alpha Zero outlined at the start of the Alpha Zero paper.
Questions

Discuss the following quote from the AlphaGo Zero paper: "Monte-Carlo tree search (MCTS) may also be viewed as a form of self-play reinforcement learning. The nodes of the search tree contain the value function for the positions encountered during search; these values are updated to predict the winner of simulated games of self-play."
Compare and contrast Alpha Zero and TD-Gammon.
The authors claim AlphaZero learns "tabula rasa". To what extent is this true? What is baked in?

Upload a single PDF file through Canvas by April 9th at 1 pm.