Provide a short discussion of each of the assigned papers (listed under Course Materials). Below are some questions to think about:
Bandit-based Monte-Carlo Planning
Optional sections: 2.4
FF-Replan
The all-outcomes determinization as described in this paper is very
weak: it just guarantees you'll take an action that has a non-zero
chance of leading to the goal. The algorithm also seems to require
constructing multi-step actions, which leads to an exponential
blow-up in the number of actions.
One way to improve it is to:
Neural Fitted Q Iteration
This algorithm can be used to "solve" a known MDP with large or continuous action spaces, to get a value function. It can also be used in a classical "online" reinforcement learning setting, with no known model or simulator and simply access to an unknown world.
Upload a single PDF file through Stellar by Feb 13 at 10 am.