6.s953 Embodied Intelligence, Spring 2024

Assignment #8

Background for all three papers

You can think of reinforcement learning as the problem of behaving in a POMDP: the world state is observable but the transition and reward models are not. There's a whole literature on this view, called Bayesian RL. It's computationally very difficult to do near-optimal Bayesian RL.
Another view is to not worry about performance while learning too much (that's what the formulation above does) but consider the case where the state is not observable. We could approach the problem using model-based RL, in which we first try to identify the POMDP model, and then solve it using methods from last class. The problem is that learning POMDP models is like learning HMMs, which is quite difficult in general.
An alternative view that makes model-learning easier is that of predictive state representations (PSR). See this or this for example.
We are going to pursue, in more detail, an approach that doesn't involve constructing or solving POMDP models, but focuses on finding a good policy.

Provide a short discussion of each of the assigned papers (listed under Course Materials). Below are questions to think about (you should discuss at least some of these but you don't have to address them all):

RL in Robotics
Just read sections 2.2.2 and 2.3. This reading is for background on policy-gradient methods; focus on finite-difference and REINFORCE methods.
Questions

For the finite-difference gradients method, if you have a d-dimensional parameterization of your policy space, how many policies do you have to evaluate in order to perform one gradient step?
Does REINFORCE require full observability?
Does REINFORCE allow the choice of action to depend on the history of previous observations and actions?
Can finite-difference or REINFORCE be guaranteed to find an optimal policy?

Solving deep memory POMDPs with recurrent policy gradients
Questions

Why might adding randomness improve a policy that can only depend on the current observation?
Is it ever *necessary* to add randomness to improve the value of a POMDP policy?
What is the "sufficient statistic" in a classical POMDP solution?
Chris suggests using a finite history of 10 actions and observations as the input to a feed-forward neural network that computes the action probabilities: how might Chris's approach be better or worse than this one?

Bridging State and History Representations: Understanding Self-Predictive RL
Questions

Consider some of the other representation learning algorithms we have seen in this class. Can you cast them as self-prediction?
How does this paper explain the stop-gradient technique?

Upload a single PDF file through Canvas by March 7th at 1 pm.