- You can think of reinforcement learning as the problem of behaving in a POMDP: the world state is observable but the transition and reward models are not. There's a whole literature on this view, called Bayesian RL. It's computationally very difficult to do near-optimal Bayesian RL.
- Another view is to not worry about performance while learning too much (that's what the formulation above does) but consider the case where the state is not observable. We could approach the problem using model-based RL, in which we first try to identify the POMDP model, and then solve it using methods from last week. The problem is that learning POMDP models is like learning HMMs, which is quite difficult in general.
- An alternative view that makes model-learning easier is that of predictive state representations (PSR), but that's beyond the scope of this course.. See this or this.
- We are going to pursue, in more detail, an approach that doesn't involve constructing or solving POMDP models, but focuses on finding a good policy.

Provide a short discussion of each of the assigned papers (listed under Course Materials). Below are some questions to think about.

**RL in Robotics**

Just read sections 2.2.2 and 2.3. This reading is for background on policy-gradient methods; focus on finite-difference and REINFORCE methods.

*Questions*

- For the finite-difference gradients method, if you have a d-dimensional parameterization of your policy space, how many policies do you have to evaluate in order to perform one gradient step?
- Does REINFORCE require full observability?
- Does REINFORCE allow the choice of action to depend on the history of previous observations and actions?
- Can finite-difference or REINFORCE be guaranteed to find an optimal policy?

- Why might adding randomness improve a policy that can only depend on the current observation?
- Is it ever *necessary* to add randomness to improve the value of a POMDP policy?
- What is the "sufficient statistic" in a classical POMDP solution? In DESPOT?
- Chris suggests using a finite history of 10 actions and observations as the input to a feed-forward neural network that computes the action probabilities: how might Chris's approach be better or worse than this one?

- How is this idea related to the fact that a POMDP is an MDP in belief space?
- What is the ideal order to do the Q-value updates after a long episode that results in a win?
- What makes the implementation of experience replay in this architecture tricky?
- Would this solution method have worked well on the problems of the previous paper? Would the previous paper's method have worked well on these problems?
- How long a window of previous inputs (as opposed to the LSTM) do you think would have been necessary to solve flickering Atari problems?

Upload a single PDF file through Stellar by **Mar 7 at 10 am**.