Provide a short discussion of each of the assigned papers (listed under Course Materials). Below are questions to think about (you should discuss at least some of these but you don't have to address them all):
Bandit-based Monte-Carlo Planning
Optional sections: 2.4
FF-Replan
The all-outcomes determinization as described in this paper is very
weak: it just guarantees you'll take an action that has a non-zero
chance of leading to the goal. The algorithm also seems to require
constructing multi-step actions, which leads to an exponential
blow-up in the number of actions.
One way to improve it is to:
Neural Fitted Q Iteration
This algorithm can be used to "solve" a known MDP with large or continuous action spaces, to get a value function. It can also be used in a classical "online" reinforcement learning setting, with no known model or simulator and simply access to an unknown world.
Upload a single PDF file through Canvas by Feb 15 at 1 pm.