### Assignment #3

Provide a short discussion of each of the assigned papers (listed under Course Materials). Below are some questions to think about:

**Bandit-based Monte-Carlo Planning**

Optional sections: 2.4

- Would UCT be useful in a problem with deterministic dynamics?
- If your domain is a tree (there are never multiple paths to the same state) is there an advantage from UCT?
- Is UCT guaranteed to find the optimal strategy if you give it enough computation?

**FF-Replan**

The all-outcomes determinization as described in this paper is very
weak: it just guarantees you'll take an action that has a non-zero
chance of leading to the goal. The algorithm also seems to require
constructing multi-step actions, which leads to an exponential
blow-up in the number of actions.

One way to improve it is to:

- For each action a, with outcomes (s1, p1), .... (sn, pn), generate n deterministic actions a1, ..., an where action i leads to state i, but has a cost of -log pi.
- Search for "shortest" path using these new actions.
- Take the first action on that path.

Given this approach:

- What can you say about the paths selected in this way?
- How could you construct a domain that this method would perform poorly on?

**Neural Fitted Q Iteration**

This algorithm can be used to "solve" a known MDP with large or continuous action spaces, to get a value function. It can also be used in a classical "online" reinforcement learning setting, with no known model or simulator and simply access to an unknown world.

- Are there limitations to its application in real environments?
- What algorithmic choices might we make differently in these two settings?

Upload a single PDF file through Stellar by **Feb 14 at 10 am**.