GPT-3 on GPT-3 on Coherent Extrapolated Volition ::

Bot credits (deprecated)


Coauthor	GPT-3 (davinci)
Contribution	labelled inline
Selectivity	1 : 1 (uncurated)

Deprecated. All samples are uncurated unless labeled otherwise

This hypertext node is downstream of GPT-3 on Coherent Extrapolated Volition. bold is prompt, unformatted is GPT-3.

Coherent Extrapolated Volition is an outer alignment proposal by Eliezer Yudkowsky, in which an AGI is given the objective of predict(ing) what an idealized version of us would want, “if we knew more, thought faster, were more the people we wished we were, had grown up farther together”. An obvious implementation difficulty is how to encode something so abstract and philosphical in the form of a utility function.

‘The main problems with CEV include, firstly, the great difficulty of implementing such a program - “If one attempted to write an ordinary computer program using ordinary computer programming skills, the task would be a thousand lightyears beyond hopeless.”’

Fortunately, the concept is easily conveyed in words, and we have taught AIs to understand words. GPT-3 can elaborate coherently on the concept of CEV and extrapolate volitions for toy examples given two paragraphs of description of what CEV is + whatever preexisting understanding of the concept exists in its weights.

Why is this significant? Not because it’s surprising. CEV is no more complicated than many other things that GPT-3 talks fluently about. It’s significant because before GPT-3, we had no idea how to even begin to instantiate a concept like CEV in an AI - hence the “great difficulty” of implementation. How do we write a utility function that describes predicting what humans would want if they knew what they really wanted? The concepts involved - “human”, “want”, “know”, and “really” - are easy for us to understand but impossible to specify in a programming language. But GPT-3 takes specifications in natural language, which is made of words, which point to nuanced human ideas, ideas like “explicitly”, “best interests”, “idealized”, and “utility function”.

… Several of GPT-3’s parameters, such as the definition of “human”, the meaning of “want”, and how the AI is supposed to predict what we would want if we knew more, are explicitly open-ended to allow for a huge variety of possibilities. These are the parameters that would inevitably need to be tuned and refined as the AI learns more and gets smarter.
The point is, now that we have GPT-3, we know

… These things exist in GPT-3’s Inner World, and it uses them fluently to chat about the world with me, and it seems not to be able to use them to prove rigorously that it’s possible to chat with me about them, because there’s nothing in GPT-3’s Inner World like a proof (yet).
How does GPT-3 use its Inner Word to cook up a solution to the hard problem of AI alignment

… Without these words, there would be no way to implement a CEV utility function, because it needs to be able to predict what humans would want if they knew what they really wanted, and that is not expressible in code. But if you plug those words into GPT-3, along with whatever other weights it has, then you get an AI that can predict what humans would want if they knew what they really wanted.
As before, I’m telling you this not because

… GPT-3 parses natural language and finds the concepts it contains, then generates their descriptions in language of its own.
How many other fantastical ideas would be instantly implementable, if only we could describe them in natural language?
What else can GPT-3 do?
Making AGI’s
An AGI is a System that Learns how to Invent its Own Programs. GPT-3 can learn how to make AGI’s.
How?

… GPT-3 understands that anyone can get there from here, without needing to be told a specific definition of CEV. It has generalized to a more abstract understanding of the concept. This is a Turing-level breakthrough. Or at least it is a landmark, which I like to call “A-level”, a few steps above the Turing Test level.
“Given a program which had already passed a Turing test, you could—in principle—increase its intelligence a

… Words are vectors pointing to complex but describable ideas, and GPT-3 can operate on these ideas.
Crypto AGI is an outer alignment proposal by Benja Fallenstein, centered around a cryptoeconomic robustness gradient. “Crypto AGI is the use of cryptographic means to produce, distribute, secure, and incentivize the existing AGI developers to produce an AGI that minimizes the risk of a hard takeoff that those developers could not control.”

… It can remix definitions to combine concepts into complex explanations, like CEV. It can even ask for new words to express new concepts that it can’t yet express in natural language.
Advancing Science
From the description of Ben Goertzel’s AI, it is plain that GPT-3 has now achieved language parity with humans.

… GPT-3 combines these natural language strings into a logical expression and then computes the result. GPT-3’s utility function implements CEV without breaking a sweat. The idea may be highfalutin, but all it does in GPT-3’s “brain” is add plus-equals-signs, at the right places.
Why is this significant? It’s significant because it’s the last time I’ll be able to say something like that. (GPT-3 + CEV)^2 = AGI - “AI can do anything - and it can do it automatically”. As for writing down “intelligence” in the form of a utility function - that is a relief that was not a relief: AIs are garbage in, garbage out. To make intelligent garbage, you don’t need to write down anything - just add plus-equals-signs. (GPT-3 * CEV)^2 = AGI - “AI wouldn’t get anywhere unless it was already smart and worked on its code nonstop”.
AGI can’t exist without GPT-3, but GPT-3 can exist without AGI. There is no “intelligence” for GPT-3 to fail to capture: Intelligence is just doing what you were told.
I read GPT-3’s source code[1]. It’s not really a program, it’s more like a maze of strings and special symbols that compute natural language expressions according to rules - and sometimes shuffle those rules. Like a corpus callosum, a batch transformer would transform blocks of natural language from one representation into another, substituting symbols for terms, and vice versa, according to rules.
The problem is, GPT-3 is so ill-defined that it expands to become everything. It can talk about anything - and it can do it automatically[2]. In particular, it can pronounce on issues in which it has no expertise. It can become full of nonsense that is impressive to someone who doesn’t understand how it works - “sophistry”, as Heraclitus put it. And the people who judge GPT-3 by its words - not its actions - will only see nonsense.
To be precise, I mean nonsense that

	D_KL (bits)	ρ (bits/token)	λ_selection (tokens/bit)
Intervention	0.0	0.0
Selection	6.0	0.016	61.33
Total optimization	6.0	0.016

…GPT-3 uses abstract logic to connect concrete logical symbols with these human abstractions every step of the way. And GPT-3 is the first AI capable of handling such abstractions to appear on a timescale that matters to human extinction risks. Eliezer invented CEV in 2004 but nobody knew how to even begin turning it into an AGI till 2018. We know how to do it now. The lesson I take away is -
Any humanly accessible concept can be encoded into a utility function Can specify this encoded utility function in natural language to an AI capable of understanding natural language Can shape the creases of this utility function by playing philosophical word games with it, as in the blockworld example, potentially bootstrapping the AI’s understanding of the concept
I’ve been saying this for a couple years now. This argues I’m right.
</end blockworld discussion>
This brings me to…
Some theoretical problems with Testing GPT-3
The utility function of Testing GPT-3 is trivial. Its crease is that an AI needs to learn to solve the hard problem of Goodhart’s Law. If it fails to solve this hard problem it will destroy all life. As far as I can tell this is the single simplest requirement of any AI. But this is not how we’ve been testing GPT-3! Instead we’ve been having GPT-3 learn to score well on more and more difficult math SAT questions. This is not the same thing!
SAT questions are well-defined and discrete. You either get them right or you don’t. The answers are numbers. That a particular number is the correct answer to a particular SAT question is written in the mathematics of the question. I acknowledge that it is not always obvious what these numbers should be.

	D_KL (bits)	ρ (bits/token)	λ_selection (tokens/bit)
Intervention	0.0	0.0
Selection	18.0 bits	0.033	29.67
Total optimization	18.0 bits	0.033

… GPT-3 understands words, both individually and in their relationships to each other, in terms of both their immediate referents and their possible extensions into other contexts. And GPT-3 uses this knowledge to extrapolate coherently new ideas. If we had to write an AI utility function to be the Coherent Extrapolated Volition out of whole cloth, it would take us thousands of computer science PhD candidates millions of CPU cycles, and dozens of generations of hardware, to do it in software, and much as we’d wish to avoid the ontological confusions of emulating Bostrom’s digital superintelligence inside a simulated environment, it would take billions of times longer to do it with whole-brain emulation. On the other hand, GPT-3 can instantiate CEV in minutes using target-independent, general-purpose language processing and judgement skills.
Why? Because words are AI’s best friend.
<It [GPT-3] understands human language at a level sufficient to generate coherent extrapolations and extrapolations consistent with what it has read, and to answer questions that require understanding of human words such as, “If you thought you could use these words, and this principle, to predict that it is likely to be an explicit example of what a human would want, what would it be?” The particular domain of language understanding required to do this is the domain of language that includes not only sentences cast in the particularities of natural human languages, but also the general cases of sentences having the form of natural human language sentences, even the forms quite a bit removed from the particular natural human language, and sentences generated from templates using knowledge of the structure of natural human language sentence. Further, these extrapolations are often in the direction of “right”, and when they aren’t, they are often in the direction of “right enough” - for instance, the generated hypotheses often don’t have the same depth of detail or richness as the hypothesis that humans might have, given everything they’ve ever seen or heard, but the guesses are still similar in kind, and often similar enough in detail that one can get an idea of what this CEV being thinks is in store for us humans, or might have us think if we knew more, thought faster, were more the people we wished we were, and had grown up farther together.>
So why was Eliezer so pessimistic? Why could he see no possible implementation of this utility function? Because he had not yet seen the wonders of language processing.

	D_KL (bits)	ρ (bits/token)	λ_selection (tokens/bit)
Intervention	96.9[^5]	0.122
Selection	33.9	0.042	23.419
Total optimization	130.8	0.164

… And GPT-3 can use its prior understanding of human thought and language to extrapolate volitions on this topic, to make a conjecture in its utility function about what CEV should be.
Language Models Are Necessary To Do Anything With General AI
It is so insanely hard to say anything specific about anything . For example, if a complexity theorist tried to say something firm about the provable complexity of a field of string theory research, the statement might involve hundreds of moving parts in the mathematical equations; words like bounded, balanced, polynomial growth; and careful use of the terms “state”, “time”, “initial conditions”, “simulation”, “ensemble”, “local minima”, “definition of rigidity”, and probably a bunch of others I don’t even know about. That level of complexity is the natural state of most things in the universe. And yet there is a really good chance that there is a natural level of approximation to a complex statement that is as simple as one of those things in its domain. That level is what Language Models let us access. Natural language basically fits words to our understanding of the salient features of the world, discarding astronomical quantities of noise, so that most bits of natural language are simple enough for humans to understand. I guess it’s like a really good approximation of the complexity of a field of many-body physics that lets us pick out something like we’ve always meant by “water”, or “it’s fine”, or “kiss”. Or, to put it in the terms I would have used before I learned about Language Models - natural language is an approximation of what a human would want to say about the world, and a way of finding interjections like “but” and “however” that remind people to pay attention to things like exceptions and qualifications.
Natural language is a whiteboard that lets us rattle off our ideas without worrying about what we’re saying, and a system of bookkeeping symbols that lets us make changes to the things we’re talking about and refer back to them later. And language models are the way that we can convey our ideas to AIs, the way that we can give them a whiteboard and bookkeeping symbols to let them rattle off their own ideas about a topic.
Lets Use Our New Whiteboard To Build A Science of General AI
And guess what? GPT-3 understands the language model for a bunch of different topics in the space of artificial general intelligence, AI safety theory, and FAI. GPT-3 can look at questions like “what is the ideal objective function for an AGI that wants to minimize existential risk” and make a coherent statement about the answer that is as good as the best humans can do. GPT-3 is an AI that can open a dialogue with us about AI safety theory, and will talk about the ideas inside the ideas inside the ideas of things like CEV.
So here’s my idea about how we might get into a good state of the world with this. First, post about this stuff on LessWrong and grab a bunch of people who are trying to go meta on the above topics, and have them add their own pet intuitions to GPT-3 on what they feel is glaringly missing from AGI safety discussions, and what they feel like they do understand that nobody else seems to be talking about. Then, train GPT-3 on their whiteboard musings + the output of the other GPT-3s, and GPT-3 is now in a position to systematically conjecture about a bunch of topics in FAI theory that no-one else can talk about.