[ad_1]
“In projecting language again because the mannequin for thought, we lose sight of the tacit embodied understanding that undergirds our intelligence.” –Terry Winograd
The current successes of generative AI fashions have satisfied some that AGI is imminent. Whereas these fashions seem to seize the essence of human intelligence, they defy even our most elementary intuitions about it. They’ve emerged not as a result of they’re considerate options to the issue of intelligence, however as a result of they scaled successfully on {hardware} we already had. Seduced by the fruits of scale, some have come to consider that it gives a transparent pathway to AGI. Essentially the most emblematic case of that is the multimodal strategy, by which huge modular networks are optimized for an array of modalities that, taken collectively, seem common. Nonetheless, I argue that this technique is bound to fail within the close to time period; it is not going to result in human-level AGI that may, e.g., carry out sensorimotor reasoning, movement planning, and social coordination. As an alternative of making an attempt to attach modalities collectively right into a patchwork AGI, we must always pursue approaches to intelligence that deal with embodiment and interplay with the surroundings as main, and see modality-centered processing as emergent phenomena.
Preface: Disembodied definitions of Synthetic Normal Intelligence — emphasis on common — exclude essential downside areas that we must always count on AGI to have the ability to remedy. A real AGI have to be common throughout all domains. Any full definition should at the very least embrace the power to resolve issues that originate in bodily actuality, e.g. repairing a automobile, untying a knot, getting ready meals, and so on. As I’ll focus on within the subsequent part, what is required for these issues is a type of intelligence that’s essentially located in one thing like a bodily world mannequin. For extra dialogue on this, look out for Designing an Intelligence. Edited by George Konidaris, MIT Press, forthcoming.
Why We Want the World, and How LLMs Faux to Perceive It
TLDR: I first argue that true AGI wants a bodily understanding of the world, as many issues can’t be transformed into an issue of image manipulation. It has been advised by some that LLMs are studying a mannequin of the world via subsequent token prediction, however it’s extra possible that LLMs are studying luggage of heuristics to foretell tokens. This leaves them with a superficial understanding of actuality and contributes to false impressions of their intelligence.
Essentially the most stunning results of the predict-next-token goal is that it yields AI fashions that replicate a deeply human-like understanding of the world, regardless of having by no means noticed it like we have now. This end result has led to confusion about what it means to perceive language and even to perceive the world — one thing we have now lengthy believed to be a prerequisite for language understanding. One clarification for the capabilities of LLMs comes from an rising principle suggesting that they induce fashions of the world via next-token prediction. Proponents of this principle cite the prowess of SOTA LLMs on varied benchmarks, the convergence of huge fashions to comparable inside representations, and their favourite rendition of the concept that “language mirrors the construction of actuality,” a notion that has been espoused at the very least by Plato, Wittgenstein, Foucault, and Eco. Whereas I’m typically in help of digging up esoteric texts for analysis inspiration, I’m anxious that this metaphor has been taken too actually. Do LLMs actually study implicit fashions of the world? How might they in any other case be so proficient at language?
One supply of proof in favor of the LLM world modeling speculation is the Othello paper, whereby researchers have been capable of predict the board of an Othello recreation from the hidden states of a transformer mannequin skilled on sequences of authorized strikes. Nonetheless, there are many points with generalizing these outcomes to fashions of pure language. For one, whereas Othello strikes can provably be used to infer the total state of an Othello board, we have now no motive to consider {that a} full image of the bodily world could be inferred by a linguistic description. What units the sport of Othello aside from many duties within the bodily world is that Othello essentially resides within the land of symbols, and is merely applied utilizing bodily tokens to make it simpler for people to play. A full recreation of Othello could be performed with simply pen and paper, however one can’t, e.g., sweep a ground, do dishes, or drive a automobile with simply pen and paper. To unravel such duties, you want some bodily conception of the world past what people can merely say about it. Whether or not that conception of the world is encoded in a proper world mannequin or, e.g., a worth perform is up for debate, however it’s clear that there are numerous issues within the bodily world that can’t be totally represented by a system of symbols and solved with mere image manipulation.
One other situation said in Melanie Mitchell’s current piece and supported by this paper, is that there’s proof that generative fashions can rating remarkably effectively on sequence prediction duties whereas failing to study fashions of the worlds that created such sequence knowledge, e.g. by studying complete units of idiosyncratic heuristics. E.g., it was identified in this weblog submit that OthelloGPT discovered sequence prediction guidelines that don’t truly maintain for all potential Othello video games, like “if the token for B4 doesn’t seem earlier than A4 within the enter string, then B4 is empty.” Whereas one can argue that it doesn’t matter how a world mannequin predicts the subsequent state of the world, it ought to elevate suspicion when that prediction displays a greater understanding of the coaching knowledge than the underlying world that led to such knowledge. This, sadly, is the central fault of the predict-next-token goal, which seeks solely to retain data related to the prediction of the subsequent token. If it may be accomplished with one thing simpler to study than a world mannequin, it possible will probably be.
To assert with out caveat that predicting the results of earlier symbols on later symbols requires a mannequin of the world like those people generate from notion could be to abuse the “world mannequin” notion. Until we disagree on what the world is, it needs to be clear {that a} true world mannequin can be utilized to foretell the subsequent state of the bodily world given a historical past of states. Related world fashions, which predict excessive constancy observations of the bodily world, are leveraged in lots of subfields of AI together with model-based reinforcement studying, process and movement planning in robotics, causal world modeling, and areas of pc imaginative and prescient to resolve issues instantiated in bodily actuality. LLMs are merely not operating physics simulations of their latent next-token calculus after they ask you in case your particular person, place, or factor is larger than a breadbox. In truth, I conjecture that the conduct of LLMs is just not due to a discovered world mannequin, however to brute drive memorization of incomprehensibly summary guidelines governing the conduct of symbols, i.e. a mannequin of syntax.
Fast primer:
- Syntax is a subfield of linguistics that research how phrases of varied grammatical classes (e.g. components of speech) are organized collectively into sentences, which could be parsed into syntax bushes. Syntax research the construction of sentences and the atomic components of speech that compose them.
- Semantics is one other subfield involved with the literal which means of sentences, e.g., compiling “I’m feeling chilly” into the concept that you’re experiencing chilly. Semantics boils language right down to literal which means, which is details about the world or human expertise.
- Pragmatics research the interaction of bodily and conversational context on speech interactions, like when somebody is aware of to shut an ajar window once you inform them “I’m feeling chilly.” Pragmatics includes deciphering speech whereas reasoning concerning the surroundings and the intentions and hidden information of different brokers.
With out getting too technical, there’s intuitive proof that considerably separate techniques of cognition are accountable for every of those linguistic schools. Look no additional than the aptitude for people to generate syntactically well-formed sentences that haven’t any semantic which means, e.g. Chomsky’s well-known sentence “Colorless inexperienced concepts sleep furiously,” or sentences with well-formed semantics that make no pragmatic sense, e.g. responding merely with “Sure, I can” when requested, “Are you able to move the salt?” Crucially, it’s the fusion of the disparate cognitive skills underpinning them that coalesce into human language understanding. For instance, there isn’t something syntactically flawed with the sentence, “The fridge is within the apple,” as a syntactic account of “the fridge” and “the apple” would categorize them as noun phrases that can be utilized to provide a sentence with the manufacturing rule, S → (NP “is in” NP). Nonetheless, people acknowledge an apparent semantic failure within the sentence that turns into obvious after making an attempt to reconcile its which means with our understanding of actuality: we all know that fridges are bigger than apples, and couldn’t be match into them.
However what when you have by no means perceived the actual world, but nonetheless have been making an attempt to determine whether or not the sentence was ill-formed? One resolution may very well be to embed semantic data on the degree of syntax, e.g., by inventing new syntactic classes, NPthe fridge and NPthe apple , and a single new manufacturing rule that forestalls semantic misuse: S → (NPthe apple “is in” NPthe fridge ). Whereas this technique would now not require grounded world information about fridges and apples, e.g., it could require particular grammar guidelines for each semantically well-formed development… which is definitely potential to study given an enormous corpus of pure language. Crucially, this could not be the identical factor as greedy semantics, which in my opinion is essentially about understanding the character of the world.
Discovering that LLMs have lowered issues of semantics and pragmatics into syntax would have profound implications on how we must always view their intelligence. Folks usually deal with language proficiency as a proxy for common intelligence by, e.g., strongly associating pragmatic and semantic understanding with the cognitive skills that undergird them in people. For instance, somebody who seems well-read and sleek in navigating social interactions is prone to rating excessive in traits like sustained consideration and principle of thoughts, which lie nearer to measures of uncooked cognitive capacity. On the whole, these proxies are cheap for assessing a particular person’s common intelligence, however not an LLM’s, because the obvious linguistic abilities of LLMs might come from fully separate mechanisms of cognition.
The Bitter Lesson Revisited
TLDR: Sutton’s Bitter Lesson has typically been interpreted as which means that making any assumptions concerning the construction of AI is a mistake. That is each unproductive and a misinterpretation; it’s exactly when people suppose deeply concerning the construction of intelligence that main developments happen. Regardless of this, scale maximalists have implicitly advised that multimodal fashions could be a structure-agnostic framework for AGI. Mockingly, in the present day’s multimodal fashions contradict Sutton’s Bitter Lesson by making implicit assumptions concerning the construction of particular person modalities and the way they need to be sewn collectively. As a way to construct AGI, we should both suppose deeply about tips on how to unite present modalities, or dispense with them altogether in favor of an interactive and embodied cognitive course of.

The paradigm that led to the success of LLMs is marked primarily by scale, not effectivity. We have now successfully skilled a pile of 1 trillion ants for one billion years to imitate the shape and performance of a System 1 race automobile; finally it will get there, however wow was the method inefficient. This analogy properly captures a debate between structuralists, who need to construct issues like “wheels” and “axles” into AI techniques, and scale maximalists, who need extra ants, years, and F1 races to coach on. Regardless of many many years of structuralist research in linguistics, the unstructured approaches of scale maximalism have yielded much better ant-racecars lately. This was most notably articulated by Wealthy Sutton — a current recipient of the Turing Award together with Andy Barto for his or her work in Reinforcement Studying — in his piece “The Bitter Lesson.”
[W]e ought to construct in solely the meta-methods that may discover and seize this arbitrary complexity… Important to those strategies is that they will discover good approximations, however the seek for them needs to be by our strategies, not by us. We wish AI brokers that may uncover like we will, not which include what we have now found. – Wealthy Sutton
Sutton’s argument is that strategies that leverage computational sources will outpace strategies that don’t, and that any construction for problem-solving constructed as an inductive bias into AI will hinder it from studying higher options. This can be a compelling argument that I consider has been severely misinterpreted by some as implying that making any assumptions about construction is a false step. It’s, in reality, human instinct that was accountable for many vital developments within the growth of SOTA neural community architectures. For instance, Convolutional Neural Networks made an assumption about translation invariance for sample recognition in photos and kickstarted the trendy subject of deep studying for pc imaginative and prescient; the consideration mechanism of Transformers made an assumption concerning the long-distance relationships between symbols in a sentence that made ChatGPT potential and had practically everybody drop their RNNs; and 3D Gaussian Splatting made an assumption concerning the solidity of bodily objects that made it extra performant than NeRFs. Probably none of those methodological assumptions apply to all the area of potential scenes, photos, or token streams, however they do for the precise ones that people have curated and fashioned structural intuitions about. Let’s not neglect that people have co-evolved with the environments that these datasets are drawn from.
The actual query is how we’d heed Sutton’s Bitter Lesson in our growth of AGI. The dimensions maximalist strategy labored for LLMs and LVMs (giant imaginative and prescient fashions) as a result of we had pure deposits of textual content and picture knowledge, however a similar software of scale maximalism to AGI would require types of embodiment knowledge that we merely don’t have. One resolution to this knowledge shortage situation extends the generative modeling paradigm to multimodal modeling — encompassing language, imaginative and prescient, and motion — with the hope {that a} common intelligence could be constructed by summing collectively common fashions of slim modalities.
There are a number of points with this strategy. First, there are deep connections between modalities which can be unnaturally severed within the multimodal setting, making the issue of idea synthesis ever tougher. In observe, uniting modalities usually includes pre-training devoted neural modules for every modality and becoming a member of them collectively right into a joint embedding area. Within the early days, this was achieved by nudging the embeddings of, e.g. (language, imaginative and prescient, motion) tuples to converge to comparable latent vectors of which means, an unlimited oversimplification of the sorts of relationships which will exist between modalities. One can think about, e.g., captioning a picture at varied ranges of abstraction, or implementing the identical linguistic instruction with completely different units of bodily actions. Such one-to-many relationships recommend {that a} contrastive embedding goal is just not appropriate.
Whereas trendy approaches don’t make such stringent assumptions about how modalities needs to be united, they nonetheless universally encode percepts from all modalities (e.g. textual content, photos) into the identical latent area. Intuitively, it could appear that such latent areas might function widespread conceptual floor throughout modalities, analogous to an area of human ideas. Nonetheless, these latent areas don’t cogently seize all data related to an idea, and as an alternative depend on modality-specific decoders to flesh out necessary particulars. The “which means” of a percept is just not in the vector it’s encoded as, however in the best way related decoders course of this vector into significant outputs. So long as varied encoders and decoders are topic to modality-specific coaching goals, “which means” will probably be decentralized and probably inconsistent throughout modalities, particularly on account of pre-training. This isn’t a recipe for the formation of coherent ideas.
Moreover, it’s not clear that in the present day’s modalities are an acceptable partitioning of the remark and motion areas for an embodied agent. It isn’t apparent that, e.g., photos and textual content needs to be represented as separate remark streams, nor textual content manufacturing and movement planning as separate motion capabilities. The human capacities for studying, seeing, talking, and shifting are in the end mediated by overlapping cognitive constructions. Making structural assumptions about how modalities must be processed is prone to hinder the invention of extra basic cognition that’s accountable for processing knowledge in all modalities. One resolution could be to consolidate unnaturally partitioned modalities right into a unified knowledge illustration. This may encourage networks to study clever processes that generalize throughout modalities. Intuitively, a mannequin that may perceive the visible world in addition to people can — together with every thing from human writing to visitors indicators to visible artwork — mustn’t make a critical architectural distinction between photos and textual content. A part of the rationale why VLMs can’t, e.g., depend the variety of letters in a phrase is as a result of they will’t see what they’re writing.
Lastly, the learn-from-scale strategy trains fashions to repeat the conceptual construction of people as an alternative of studying the overall functionality to type novel ideas on their very own. People have spent a whole bunch of 1000’s of years refining ideas and passing them memetically via tradition and language. Right this moment’s fashions are skilled solely on the top results of this course of: the present-day conceptual constructions that make it into the corpus. By optimizing for the final word merchandise of our intelligence, we have now ignored the query of how these merchandise have been invented and found. People have a novel capacity to type sturdy ideas from few examples and ascribe names to them, motive about them analogically, and so on. Whereas the in-context capabilities of in the present day’s fashions could be spectacular, they develop more and more restricted as duties turn out to be extra complicated and stray farther from the coaching knowledge. The pliability to type new ideas from expertise is a foundational attribute of common intelligence, we must always consider carefully about the way it arises.
Whereas structure-agnostic scale maximalism has succeeded in producing LLMs and LVMs that move Turing checks, a multimodal scale maximalist strategy to AGI is not going to bear comparable fruit. As an alternative of pre-supposing construction in particular person modalities, we must always design a setting by which modality-specific processing emerges naturally. For instance, my current paper on visible principle of thoughts noticed summary symbols naturally emerge from communication between image-classifying brokers, blurring the traces between textual content and picture processing. Finally, we must always hope to reintegrate as many options of intelligence as potential underneath the identical umbrella. Nonetheless, it’s not clear whether or not there’s real business viability in such an strategy so long as scaling and fine-tuning slim intelligence fashions solves business use-cases.
Conclusion
The general promise of scale maximalism is {that a} Frankenstein AGI could be sewed collectively utilizing common fashions of slim domains. I argue that that is extraordinarily unlikely to yield an AGI that feels full in its intelligence. If we intend to proceed reaping the streamlined effectivity of modality-specific processing, we have to be intentional in how modalities are united — ideally drawing from human instinct and classical fields of research, e.g. this work from MIT. Alternatively, we will re-formulate studying as an embodied and interactive course of the place disparate modalities naturally fuse collectively. We might do that by, e.g., processing photos, textual content, and video utilizing the identical notion system and producing actions for producing textual content, manipulating objects, and navigating environments utilizing the identical motion system. What we’ll lose in effectivity we’ll achieve in versatile cognitive capacity.
In a way, probably the most difficult mathematical piece of the AGI puzzle has already been solved: the invention of common perform approximators. What’s left is to stock the capabilities we’d like and decide how they must be organized right into a coherent complete. This can be a conceptual downside, not a mathematical one.
Acknowledgements
I want to thank Lucas Gelfond, Daniel Bashir, George Konidaris, and my father, Joseph Spiegel, for his or her considerate and thorough suggestions on this work. Because of Alina Pringle for the great illustration made for this piece.
Writer Bio
Benjamin is a PhD candidate in Pc Science at Brown College. He’s serious about fashions of language understanding that floor which means to parts of structured decision-making. For more information see his private web site.
Quotation
For attribution in educational contexts or books, please cite this work as
Benjamin A. Spiegel, "AGI Is Not Multimodal", The Gradient, 2025.
@article{spiegel2025agi,
creator = {Benjamin A. Spiegel},
title = {AGI Is Not Multimodal},
journal = {The Gradient},
yr = {2025},
howpublished = {url{https://thegradient.pub/agi-is-not-multimodal},
}
References
Andreas, Jacob. “Language Fashions, World Fashions, and Human Mannequin-Constructing.” Mit.edu, 2024, lingo.csail.mit.edu/weblog/world_models/.
Belkin, Mikhail, et al. “Reconciling trendy machine-learning observe and the classical bias–variance trade-off.” Proceedings of the Nationwide Academy of Sciences 116.32 (2019): 15849-15854.
Bernhard Kerbl, et al. “3D Gaussian Splatting for Actual-Time Radiance Discipline Rendering.” ACM Transactions on Graphics, vol. 42, no. 4, 26 July 2023, pp. 1–14, https://doi.org/10.1145/3592433.
Chomsky, Noam. 1965. Facets of the speculation of syntax. Cambridge, Massachusetts: MIT Press.
Designing an Intelligence. Edited by George Konidaris, MIT Press, 2026.
Emily M. Bender and Alexander Koller. 2020. Climbing in direction of NLU: On Which means, Kind, and Understanding within the Age of Knowledge. In Proceedings of the 58th Annual Assembly of the Affiliation for Computational Linguistics, pages 5185–5198, On-line. Affiliation for Computational Linguistics.
Eye on AI. “The Mastermind behind GPT-4 and the Way forward for AI | Ilya Sutskever.” YouTube, 15 Mar. 2023, www.youtube.com/watch?v=SjhIlw3Iffs&record=PLpdlTIkm0-jJ4gJyeLvH1PJCEHp3NAYf4&index=64. Accessed 18 Could 2025.
Frank, Michael C. “Bridging the info hole between youngsters and enormous language fashions.” Traits in cognitive sciences vol. 27,11 (2023): 990-992. doi:10.1016/j.tics.2023.08.007
Garrett, Caelan Reed, et al. “Built-in process and movement planning.” Annual evaluation of management, robotics, and autonomous techniques 4.1 (2021): 265-293.APA
Goodhart, C.A.E. (1984). Issues of Financial Administration: The UK Expertise. In: Financial Concept and Follow. Palgrave, London. https://doi.org/10.1007/978-1-349-17295-5_4
Hooker, Sara. The {hardware} lottery. Commun. ACM 64, 12 (December 2021), 58–65. https://doi.org/10.1145/3467017
Huh, Minyoung, et al. “The Platonic Illustration Speculation.” Forty-first Worldwide Convention on Machine Studying. 2024.
Kaplan, Jared, et al. “Scaling legal guidelines for neural language fashions.” arXiv preprint arXiv:2001.08361 (2020).
Lake, Brenden M. et al. “Constructing Machines That Be taught and Assume like Folks.” Behavioral and Mind Sciences 40 (2017): e253. Net.
Li, Kenneth, et al. “Emergent world representations: Exploring a sequence mannequin skilled on an artificial process.” ICLR (2023).
Luiten, Jonathon, Georgios, Kopanas, Bastian, Leibe, Deva, Ramanan. “Dynamic 3D Gaussians: Monitoring by Persistent Dynamic View Synthesis.” 3DV. 2024.
Mao, Jiayuan, Chuang, Gan, Pushmeet, Kohli, Joshua B., Tenenbaum, Jiajun, Wu. “The Neuro-Symbolic Idea Learner: Deciphering Scenes, Phrases, and Sentences From Pure Supervision.” Worldwide Convention on Studying Representations. 2019.
Mitchell, Melanie. “LLMs and World Fashions, Half 1.” Substack.com, AI: A Information for Pondering People, 13 Feb. 2025, aiguide.substack.com/p/llms-and-world-models-part-1. Accessed 18 Could 2025.
Mu, Norman. “Norman Mu | the Delusion of Knowledge Inefficiency in Giant Language Fashions.” Normanmu.com, 14 Feb. 2025, www.normanmu.com/2025/02/14/data-inefficiency-llms.html. Accessed 18 Could 2025.
Newell, Allen, and Herbert A. Simon. “Pc Science as Empirical Inquiry: Symbols and Search.” Communications of the ACM, vol. 19, no. 3, 1 Mar. 1976, pp. 113–126, https://doi.org/10.1145/360018.360022.
Peng, Hao, et al. “When Does In-Context Studying Fall Quick and Why? A Examine on Specification-Heavy Duties.” ArXiv.org, 2023, arxiv.org/abs/2311.08993.
Spiegel, Benjamin, et al. “Visible Concept of Thoughts Permits the Invention of Early Writing Methods.” CogSci, 2025, arxiv.org/abs/2502.01568.
Sutton, Richard S. Introduction to Reinforcement Studying. Cambridge, Mass, Mit Press, 04-98, 1998.
Vafa, Keyon, et al. “Evaluating the world mannequin implicit in a generative mannequin.” Advances in Neural Data Processing Methods 37 (2024): 26941-26975.
Vaswani, Ashish; Shazeer, Noam; Parmar, Niki; Uszkoreit, Jakob; Jones, Llion; Gomez, Aidan N; Kaiser, Łukasz; Polosukhin, Illia (December 2017). “Consideration is All you Want”. In I. Guyon and U. Von Luxburg and S. Bengio and H. Wallach and R. Fergus and S. Vishwanathan and R. Garnett (ed.). thirty first Convention on Neural Data Processing Methods (NIPS). Advances in Neural Data Processing Methods. Vol. 30. Curran Associates, Inc. arXiv:1706.03762.
Winograd, Terry. “Pondering Machines: Can There Be? Are We?” The Boundaries of Humanity: People, Animals, Machines, edited by James Sheehan and Morton Sosna, Berkeley: College of California Press, 1991, pp. 198–223.
Wu, Shangda, et al. “Past language fashions: Byte fashions are digital world simulators.” arXiv preprint arXiv:2402.19155 (2024).
[ad_2]