6.882 Embodied Intelligence, Spring 2019

Assignment #6

Provide a short discussion of each of the assigned papers (listed under Course Materials). Below are some questions to think about (you don't have to answer all of them, the general questions may be the most interesting):

Large Scale Unsupervised Learning

This paper suggests that compressed representations are meaningful representations -- they align with English words, like "cat". Why do you think this is? Can you imagine a universe in which this would not be the case?
The paper mentions that previous methods involved "a certain degree of supervision" since training images were "aligned, homogeneous and belong to one selected category." Is this a substantial limiation? Does the current paper get around it?
Fig 3 shows two visualizations of what a trained neuron responds to. What are the advantages and disadvantages of each approach. Can you think of a scenario where one or the other would lead to a misleading visualization?

Learning Image Representations Tied to Egomotion

The kitten carousel is an amusing motivation, but there's at least one big difference between the kitten and the method in this paper: the kitten chose the actions! In contrast, the presented method is just given a bunch of actions, from a third party, and learns a representation from them. Do you think this is an important difference? Could an agent learn a better representation if it got to decide on its own actions?
Is equivariance always better than invariance? What are some transformations for which we would prefer to learn invariant representations?

Contrastive Predictive Coding

CPC's objective is to learn a representation of the present that is maximally informative about the future. The raw pixels themselves are one such representation (by the data processing inequality, they are maximally informative). Why does CPC learn a better representation than just using the raw pixels?
Suppose we apply CPC to video frames. In that case, CPC would learn to classify between the real future frame, x_{t+k}, and a mismatched future frame, x_{j}, where x_{j} is sampled from a different video than x_{t}. What do you think will happen if each video in our dataset has a uniquely colored border around each of its frames? What features will be learned?

General questions

This week we have seen two approaches to perceptual representation. Last class was "supervised" (we learned to imitate human designed representations: segments, physics simulators, the notion of grasp). This class is "unsupervised" (natural representations emerge from very abstract modeling objectives).

What are the advantages and disadvantages of supervised versus unsupervised.
Do you think any of the unsupervised methods, scaled way up, will be enough to learn human-like mental representations? Is anything missing?
Do you see a practical way forward for solving representations in a purely supervised mode, that is, just by imitating human knowledge?

Upload a single PDF file through Stellar by Feb 28 at 10 am.