6.s953 Embodied Intelligence, Spring 2024

Assignment #6

Provide a short discussion of each of the assigned papers (listed under Course Materials). Below are questions to think about (you should discuss at least some of these but you don't have to address them all):

Contrastive Predictive Coding

CPC's stated objective is to learn a representation of the present that is maximally informative about the future. The raw pixels themselves are one such representation (by the data processing inequality, they are maximally informative). Why does CPC learn a better representation than just using the raw pixels?
Suppose we apply CPC to video frames. In that case, CPC would learn to classify between the real future frame, x_{t+k}, and a mismatched future frame, x_{j}, where x_{j} is sampled from a different video than x_{t}. What do you think will happen if each video in our dataset has a uniquely colored border around each of its frames? What features will be learned?

Learning image representations tied to ego-motion

The kitten carousel is an amusing motivation, but there's at least one big difference between the kitten and the method in this paper: the kitten chose the actions! In contrast, the presented method is just given a bunch of actions, from a third party, and learns a representation from them. Do you think this is an important difference? Could an agent learn a better representation if it got to decide on its own actions?
Is equivariance always better than invariance? What are some cases where we might prefer to learn invariant representations?

Quasimetric Reinforcement Learning

QRL learns state/goal embeddings that relate to V* (i.e. cost of optimal plan from state to goal). Do you think these embeddings would be good state representations for other purposes? For example, would they be useful as general purpose features to condition a policy on? What might be some limitations to this representation?
Imagine you are a giant. You pick up the Kendall T station and lift it high in the air. As you lift, the red line subway tracks get pulled into the air, and then other T stops get uprooted, and the whole network rises. Now suppose you hold Kendall many miles high and all the other stations dangle below it, held up by the various tracks.

What is the height of each station above the ground? Give an answer in terms of H, the height of Kendall, and V(s), the time it takes to travel from Kendall to stop s along the shortest route.
Our claim: you, the giant, just ran (metric, single-goal) QRL over the MBTA graph, just by picking up Kendall. Explain this.

Upload a single PDF file through Canvas by Feb 29 at 1 pm.