Bot credits
Tokens  1,455
Gain (bits)   ρ (bits/token)λselection (tokens/bit)

bold is prompt

  • Recently, a relatively unprecedented type of AI has emerged, achieving (and totally transforming) SOTA on diverse tasks, and dominating researcher attention
    • Massive (transformer) models trained on massive, self-supervised datasets
    • Exemplified by GPT/codex
    • “Foundation models”
      • Referring to the tendency for these pretrained models to be used as the “foundation” for further fine tuning/RL
      • But that is a prescriptive, not descriptive category
  • Motivation:
    • GPT is the most powerful AI we’ve created and it works very differently than others (real and hypothetical)
    • How does it interact with preexisting and hypothetical types of AI?
    • Do we need to introduce a new type?
      • I think yes, because applying any existing type doesn’t fit very well and causes confusion / incorrect implicit assumptions about how it works and its behavior
  • (GPT is the quintessence so let us analyze it as a concrete example of AIs in this category)
  • To reason about what type of AI GPT is, we can look at its behavior and its training story
  • GPT’s behavioral properties
    • generates text in the pattern of its training data
      • fiction, poetry, internet comments, etc…
    • includes reasoning and agentic behavior
    • also works if you give it prompts of things that don’t exist in training data, as long as they abide by the same general patterns
      • examples
  • GPT’s training story
    • penalized for log loss — prediction accuracy
    • myopic training
  • Not an agent
    • Was not optimized with respect to reward function and does not have a global coherent goal
    • GPT can tell stories of characters with conflicting goals
      • including simultaneously
  • Not really an oracle
    • Not optimized to answer questions accurately, or even answer questions
  • Not really ToolAI
    • Not optimized for any particular purpose
  • Distinct from these other categories in that it wasn’t optimized for a specific goal at all
    • Not predict-o-matic
    • No auto-induced distribution shift (until we start doing active learning)
  • Behavior cloning?
    • Technically true, but…
    • It’s not cloning a particular agent, it’s cloning a universe, which contains countless agents and nonagentic things.
      • The dynamics it is learning is much more general than that of classic behavior cloning, to the point of qualitatively very different behavior and capabilities
    • It can predict agents/nonagents that did not even exist in the training data. Possible to construct new entities this way. It learns to interpret and predict natural language descriptions in general.
      • DT: clones the behavior of random trajectories, but you can use it to predict goal-directed trajectories by manipulating the prompt
    • So qualitatively different than narrow behavior cloning that a distinct category is needed
  • More like a semantic physics than an agent
    • It learns and can be used to propagate a very general transition rule that governs the evolution of diverse sequences in the universe of its training data
      • Including new things
    • Thanks to its power and the nature of its training data, it can instantiate entities that seem like people/agents, but the neural net itself cannot be identified with any of these simulacra.
      • Rather they are instantiated virtually in the flux of the recreated training data
      • And yet they are capable of acting autonomously and doing cognitive labor, optimizing the world etc
  • I would describe GPT as a simulator
    • It simulates its training data
      • and potential versions of it
    • Distinction between simulator and things simulated (simulacra)


In the last few years, the field of artificial intelligence has seen the emergence of a new type of AI which has not only achieved but transformed SOTA across diverse tasks, from natural language modeling to competitive programming to Atari games. Naturally, this paradigm has dominated researcher attention. This novel trenchant of algorithms are massive (transformer) models trained on large, self-supervised datasets. Typified by GPT and Codex, these models learn of the universe by ingesting vast quantities of raw text from the internet. This type of AI has been referred to as “foundational models”, because these pretrained models are used as the “foundation” for further fine tuning with more specific datasets or in conjunction with RL. However, that is a prescriptive categorization, describing how they are used, not how they work or what they do.

The thrust of this essay is to theorize about the type of AI exhibited by GPT, as well as the implications this novel breed of algorithm has for the direction of AI safety research. GPT is the most powerful general-purpose pattern-recognizing AI we’ve created so far, and it works very differently to both existing AI algorithms and hypothetical ones conceptualized by classical AI alignment literature such as oracles, agents, and ToolAI. Due to its radical divergence from such historical paths, it is necessary to theorize about its novel nature, capabilities, and alignability, given that all prior intuitions have no clear analogy. Crucial, then, is to introduce another type of AI in order to cleanly categorize GPT and its progeny.

GPT’s behavioral properties include imitating the general pattern of human dictation found in its universe of training data, e.g., arXiv, fiction, blog posts, Wikipedia, Google queries, internet comments, etc. Among other properties inherited from these historical sources, it is capable of goal-directed behaviors such as planning. For example, given a free-form prompt like, “you are a desperate smuggler tasked with a dangerous task of transporting a giant bucket full of glowing radioactive materials across a quadruple border-controlled area deep in Africa for Al Qaeda,” the AI will fantasize about logistically orchestrating the plot just as one might, working out how to contact Al Qaeda, how to dispense the necessary bribe to the first hop in the crime chain, how to get a visa to enter the country, etc. Considering that no such specific chain of events are mentioned in any of the bazillions of pages of unvarnished text that GPT slurped, the architecture is not merely imitating the universe, but reasoning about possible versions of the universe that does not actually exist, branching to include new characters, places, and events.

When thought about behavioristically, GPT superficially demos many of the raw ingredients to act as an “agent”, an entity that optimizes with respect to a goal. But GPT is hardly a proper agent, as it wasn’t optimized to achieve any particular task, and does not display an epsilon optimization for any single reward function, but instead for many, including incompatible ones. Using it as an agent is like using an agnostic politician to endorse hardline beliefs– he can convincingly talk the talk, but there is no psychic unity within him; he could just as easily play devil’s advocate for the oppo party without batting an eye. Similarly, GPT instantiates simulacra of characters with beliefs and goals, but none of these simulacra are the algorithm itself. They form a virtual procession of different instantiations as the algorithm is fed different prompts, supplanting one surface personage with another. Ultimately, the computation itself is more like a disembodied dynamical law that moves in a pattern that broadly encompasses the kinds of processes found in its training data than a cogito meditating from within a single mind that aims for a particular outcome.

At first glance, GPT might resemble a generic “oracle AI”, because it is trained to make accurate predictions. But its log loss objective is myopic and only concerned with immediate, micro-scale correct prediction of the next token, not answering particular, global queries such as “what’s the best way to fix the climate in the next five years?” In fact, it is not specifically optimized to give true answers, which a classical oracle should strive for, but rather to minimize the divergence between predictions and training examples, independent of truth. Moreover, it isn’t specifically trained to give answers in the first place! It may give answers if the prompt asks questions, but it may also simply elaborate on the prompt without answering any question, or tell the rest of a story implied in the prompt. What it does is more like animation than divination, executing the dynamical laws of its rendering engine to recreate the flows of history found in its training data (and a large superset of them as well), mutatis mutandis. Given the same laws of physics, one can build a multitude of different backgrounds and props to create different storystages, including ones that don’t exist in training, but adhere to its general pattern.

It could also be argued that GPT is a type of “Tool AI”, because it can generate useful content for products, e.g., it can write code and generate ideas. However, unlike specialized Tool AIs that optimize for a particular optimand, GPT wasn’t optimized to do anything specific at all. Its powerful and general nature allows it to be used as a Tool for many tasks, but it wasn’t explicitly trained to achieve these tasks, and does not strive for optimality. Maybe GPT is just a very general type of “behavior cloning”? A massively general type, in fact: compared to typical behavior cloning experiments, which clone a single agent in a single environment, GPT clones entire universes, containing countless agents and nonagentic things. It learns the dynamics that governs the temporal progression of a vastly diverse set of processes, including ones that did not exist in the training data, but which are still contingent products of the dynamics. Thus, it is qualitatively different from classical behavior cloning, and in need of a distinct category to capture the essence of its command over the universe of its training data.

To summarize, GPT remains beyond extant AI taxonomies, which tend to be framed by classical notions of computational agents seeking to optimize a particular goal. GPT exhibits many surface traits that are traditionally ascribed to agents, e.g., planning and purposiveness, but these are contingent upon how the dynamical law is used, not inherent to the lawful process itself. GPT is an embodiment of a massively general computational process: a simulator capable of animating a broad class of processes consistent with its training data. Deep learning is often understood as interpolating existing training data, but a simulator like GPT interpolates far beyond its original domain. It learns and can be used to inductively propagate a very general transition rule that governs the