The Platonic Representation Hypothesis
Minyoung Huh* Brian Cheung* Tongzhou Wang* Phillip Isola*
MIT
Position Paper in ICML 2024
Paper Code
Outline

Our hypothesis

How to measure convergence?

Evidence of convergence

What is driving convergence?

What are we converging to?
The world (Z) can be viewed in many different ways: in images (X), in text (Y), etc. We conjecture that representations learned on each modality on its own will converge to similar representations of Z.
Conventionally, different AI systems represent the world in different ways. A vision system might represent shapes and colors, a language model might focus on syntax and semantics. However, in recent years, the architectures and objectives for modeling images and text, and many other signals, are becoming remarkably alike. Are the internal representations in these systems also converging?

We argue that they are, and put forth the following hypothesis:

Neural networks, trained with different objectives on different data and modalities, are converging to a shared statistical model of reality in their representation spaces.


The intuition behind our hypothesis is that all the data we consume -- images, text, sounds, etc -- are projections of some underlying reality. A concept like
"apple"
     🍎
can be viewed in many different ways but the meaning, what is represented, is roughly* the same. Representation learning algorithms might recover this shared meaning.
* Not exactly the same. The text "apple" does not tell whether the fruit is red or green, but an image can. Sufficiently descriptive text is necessary. See the limitations section of our paper for discussion of this point.

How to measure if representations are converging?

We characterize representations in terms of their kernels, i.e. how they measure distance/similarity between inputs. Two representations are considered the same if their kernels are the same for corresponding inputs. We then say the representations are aligned. For example, if a text encoder ftext is aligned with an image encoder fimg, then we would have relationships like:

sim( f text ( "apple" ),  f text ( "orange" ))   ≈    sim( f img (🍎),  f img (🍊))
\[ \text{sim}(f_{\text{text}}(\text{“apple"}), f_{\text{text}}(\text{“orange"})) \quad\approx\quad \text{sim}(f_{\text{img}}(\text{🍎}), f_{\text{img}}(\text{🍊})) \]

Kernel alignment metrics quantify the degree to which statements like the above are true, and we use these metrics to analyze if representations in different models are converging. Check out our code for implementations of such metrics, including several new ones we introduce.

Evidence of convergence

We survey many examples of convergence in the literature: over time and across multiple domains, the ways by which different neural networks represent data are becoming more aligned. Then, we demonstrate convergence across data modalities: as vision models and language models get larger, they measure distance between datapoints in a more and more alike way:
As LLMs get better at language modeling, they learn representations that are more and more aligned with vision models (and conversely, bigger vision models are also better aligned with LLM embeddings). Plotted using voronoi.

What is driving convergence?

We argue that task and data pressures, combined with increasing model capacity, can lead to convergence. One such pressure is visualized below: As we train models on more tasks, there are fewer representations that can satisfy our demands. As models become more general-purpose, they become more alike:
The more tasks we must solve, the fewer functions satisfy them all. Cao & Yamins term this the "Contravariance principle."

What representation are we converging to?

In a particular idealized world, we show that a certain family of learners will converge to a representation whose kernel is equal to the pointwise mutual information (PMI) function over the underlying events (Z) that cause our observations, regardless of modality. For example, in a world of colors, where events zred and zorange generate visual and textual observations, we would have:

sim( f
"red"
🟥
),  f (
"orange"
🟧
))   =    PMI( z red z orange ) + const
\[ \text{sim}(f_{\text{text}}(\text{“apple"}), f_{\text{text}}(\text{“orange"})) \quad=\quad \text{PMI}(z_{\text{apple}}, z_{\text{orange}}) + \text{const} \] \[ \text{sim}(f(\color{red}{\blacksquare}\color{black}), f(\color{orange}{\blacksquare}\color{black})) \quad=\quad \text{PMI}(z_{\text{apple}}, z_{\text{orange}}) + \text{const} \]

This analysis makes various assumptions and should be read as a starting point for a fuller theory. Nonetheless, empirically, we do find that PMI over pixel colors recovers a similar kernel to human perception of colors, and this is also similar to the kernel that LLMs recover:
This analysis suggests that certain representation learning algorithms may boil down to a simple rule: find an embedding in which similarity equals pointwise mutual information.
Kernels visualized with multidimensional scaling (i.e. a visualization where nearby points are similar according to the kernel, and far apart points are dissimilar). The language experiment here is a replication of Abdou et al. 2021.

Implications and limitations

The final sections of our paper discuss implications and limitations of the hypothesis. Perhaps the primary implication is this: if there is indeed a platonic representation, then finding it, and fully characterizing it, is a research program worth pursuing.

However, like any good hypothesis, there are also numerous counterarguments one can make: what about the knowledge that is unique to each model and modality? What about specialist systems, that don't require general-purpose world representations? We hope this work sparks vigorous debate.

Other works that have made similar arguments:

[1] Allegory of the Cave, Plato, c. 375 BC

[2] Three Kinds of Scientific Realism, Putnam, The Philosophical Quarterly, 1982

[3] Contrastive Learning Inverts the Data Generating Process, Zimmermann, Sharma, Schneider, Bethge, Brendel, ICML 2021

[4] Revisiting Model Stitching to Compare Neural Representations, Yamini Bansal, Preetum Nakkiran, Boaz Barak, NeurIPS 2021

[5] Can Language Models Encode Perceptual Structure Without Grounding? A Case Study in Color , Abdou, Kulmizev, Hershcovich, Frank, Pavlik, Søgaard, CoNLL 2021

[6] Explanatory models in neuroscience: Part 2 -- Constraint-based intelligibility, Cao, Yamins, Cognitive Systems Research, 2024

[7] Robust agents learn causal world models, Jonathan Richens, Tom Everitt, ICLR 2024



Plato imagined an "ideal" reality of which our observations are mere shadows. Putnam and others developed the idea of "convergent realism": scientists, via observation, converge on truth; our position is that deep nets work similarly. Zimmermann et al., Richens and Everitt, and many others have argued that certain representation learners recover statistical models of the latent causes of our observations. Bansal et al. hypothesized an "Anna Karenina scenario," in which all well-performing neural nets are alike. Abdou et al. showed that LLMs learn visual similarities from text alone (an experiment we have replicated). Cao and Yamins argue for a "Contravariance Principle," by which models and minds become aligned when tasked to solve hard problems. This is a curated list of close work. Please see our paper for more.

Accessibility