Assignment #3

Provide a short discussion of each of the assigned papers (listed under Course Materials). Below are some questions to think about:

Bandit-based Monte-Carlo Planning
Optional sections: 2.4

The all-outcomes determinization as described in this paper is very weak: it just guarantees you'll take an action that has a non-zero chance of leading to the goal. The algorithm also seems to require constructing multi-step actions, which leads to an exponential blow-up in the number of actions.

One way to improve it is to:

  1. For each action a, with outcomes (s1, p1), .... (sn, pn), generate n deterministic actions a1, ..., an where action i leads to state i, but has a cost of -log pi.
  2. Search for "shortest" path using these new actions.
  3. Take the first action on that path.
Given this approach:

Neural Fitted Q Iteration
This algorithm can be used to "solve" a known MDP with large or continuous action spaces, to get a value function. It can also be used in a classical "online" reinforcement learning setting, with no known model or simulator and simply access to an unknown world.

Upload a single PDF file through Stellar by Feb 14 at 10 am.