6.s953 Embodied Intelligence, Spring 2024

Assignment #11

Provide a short discussion of each of the assigned papers (listed under Course Materials). Below are questions to think about (you should discuss at least some of these but you don't have to address them all):

World Models

Questions

For the following three questions, please see the interative demos here:

Try playing with the z sliders. What are these sliders doing?
Try changing the Tau parameter (second set of demos). What happens and why? The paper says that high Tau makes the model harder to exploit. Why is that?
In the VizDoom interactive demo, you can play the game entirely inside the network's hallucination. Can you beat the policy reported in Table 2 of the paper? What's your highest score?

Building Machines That Learn and Think Like People

Read Section 4, feel free to skim the rest.

Questions

This paper argues that human learning is often more sample efficient (i.e. requires less data) than 2016-era machine learning. Did you find this argument compelling? A possible complaint is that the comparison is not apples to apples: humans adults have extensive prior experience, while most machine algorithms do not. How might we make the comparison more fair? The footnote on page 8 provides one attempt. Is it convincing? Given equivalent priors, do you think the visual concept learning algorithm in our brain is more sample efficient than SGD? (note: I don't think anyone knows for sure, this is an open question)
Human babies seem to come with some knowledge baked in (see Section 4.1 in particular). If we, as engineers, want to build a generalist robot, what knowledge should we bake in? What should we let emerge?

UniSim
Questions

UniSim trains a world model on a large collection of 1) internet videos and 2) screen recordings from simulation engines, among other data sources. For which tasks do you think a world model trained on #1 is likely to work well? For which tasks or scenarios is it likely to fail? What about for #2? What about for #1+#2?
For case #2, we are learning a simulation of a simulation. What's the point of doing that? In what ways is UniSim's simulation of Habitat better than Habitat itself? In what ways is it worse?
Consider a mobile robot that is tasked with navigating to an apple, which is out of view. What do you think would happen if we rolled out a video autoregressively given the initial observation and the "navigate to an apple" prompt? Given that the robot can actively collect observations of the environment based on its own actions, could we condition UniSim differently? Would using video prediction models be a good idea in the first place?

Upload a single PDF file through Canvas by March 19th at 1 pm.