6.s953 Embodied Intelligence, Spring 2024

Assignment #3

Provide a short discussion of each of the assigned papers (listed under Course Materials). Below are questions to think about (you should discuss at least some of these but you don't have to address them all):

Bandit-based Monte-Carlo Planning
Optional sections: 2.4

Would UCT be useful in a problem with deterministic dynamics?
If your domain is a tree (there are never multiple paths to the same state) is there an advantage from UCT?
Is UCT guaranteed to find the optimal strategy if you give it enough computation?

FF-Replan
The all-outcomes determinization as described in this paper is very weak: it just guarantees you'll take an action that has a non-zero chance of leading to the goal. The algorithm also seems to require constructing multi-step actions, which leads to an exponential blow-up in the number of actions.

One way to improve it is to:

For each action a, with outcomes (s₁, p₁), .... (s_n, p_n), generate n deterministic actions a₁, ..., a_n where action i leads to state i, but has a cost of -log p_i.
Search for "shortest" path using these new actions.
Take the first action on that path.

Given this approach:

What can you say about the paths selected in this way?
How could you construct a domain that this method would perform poorly on?

Neural Fitted Q Iteration
This algorithm can be used to "solve" a known MDP with large or continuous action spaces, to get a value function. It can also be used in a classical "online" reinforcement learning setting, with no known model or simulator and simply access to an unknown world.

Are there limitations to its application in real environments?
What algorithmic choices might we make differently in these two settings?

Upload a single PDF file through Canvas by Feb 15 at 1 pm.