6.s953 Embodied Intelligence, Spring 2024

Assignment #5

Provide a short discussion of each of the assigned papers (listed under Course Materials). Below are questions to think about (you should discuss at least some of these but you don't have to address them all):

Dense Object Nets

This paper learns descriptors that map corresponding points to the same feature vector, and non-corresponding points to different feature vectors. They identify corresponding points (the "data association problem") via the training procedure in Secton 3.1, where they take multiple photos of an object on a table and get correspondence between the photos from an explicit 3D reconstruction. This procedure is quite expensive, as it requires a calibrated lab setup and robot arm. What might be some cheaper ways of solving the data association problem? How might an agent solve this problem in the wild?
Figure 2 shows false color images where similar colors represent similar descriptors. What are some desirable properties of the descriptors that are apparent in these visualizations. Do you notice any undesirable properties?
Section 5.3 discusses class-specific versus instance-specific descriptors. What are the relative advantages and disadvantages of these two kinds of descriptors?

Reasoning About Physical Interactions with Object-Oriented Prediction and Planning

This paper argues for a factored, "object-oriented" representation of visual state. How does this factorization make prediction and planning easier? Think about the way goals are defined and the relationship to ideas in the papers we read on planning (e.g., "Planning As Heuristic Search"). What might be some other useful factored representations of visual scenes?
The paper argues against direct supervision of state (object attributes). Instead object representations emerge somewhat indirectly, in service of a visual prediction task. What are some advantages and disadvantages of this indirect approach over the alternatives presented in Figure 2.
Algorithm 1 presents a simple planning algorithm. Can you think of a goal tower configuration for which this planner would fail? How could the planner or representation be improved to handle your example?

F3RM: Distilled Feature Fields Enable Few-Shot Language-Guided Manipulation

This paper proposes to lift pretrained 2D features into a 3D representation, called a Distilled Feature Field, using volumetric rendering.

How does F3RM’s approach to obtaining descriptors fundamentally differ from that in Dense Object Nets? Which features do you expect to be more general-purpose and generalizable? Which might offer greater accuracy in the single-object case, such as for the caterpillar or Baymax toy?
The paper considers using few-shot imitation learning to predict single 6-DOF grasp or place poses. Can all tasks be represented with a single pose? Could we combine Diffusion Policy with feature fields to predict trajectories of poses? How could we condition on a feature field?
Why does F3RM need so many images of the scene (50 are used) in order to construct the feature field? How could we reduce this requirement?
What _is_ the right object representation for robotic perception? Does it differ if the task is, e.g., navigation rather than manipulation?

Upload a single PDF file through Stellar by Feb 27 at 1 pm.