Coherent Extrapolated Volition is proposal by Eliezer Yudkowsky of an ideal objective function in which an AGI is given the objective of
predict(ing) what an idealized version of us would want, “if we knew more, thought faster, were more the people we wished we were, had grown up farther together”. An obvious implementation difficulty is how to encode something so abstract and philosphical in the form of a utility function.
The main problems with CEV include, firstly, the great difficulty of implementing such a program - “If one attempted to write an ordinary computer program using ordinary computer programming skills, the task would be a thousand lightyears beyond hopeless.”
But the concept is easily conveyed in words, and we have taught AIs to understand words. GPT-3 can elaborate coherently on the concept of CEV and extrapolate volitions for toy examples given two paragraphs of description of what CEV is + whatever preexisting understanding of the concept exists in its weights.
Why is this significant? Not because it’s surprising. CEV is no more complicated than many other things that GPT-3 talks fluently about. It’s significant because before GPT-3, we had no idea how to even begin to instantiate a concept like CEV in an AI - it seemed “a thousand lightyears beyond hopeless”, as Eliezer put it. How do we write a utility function that describes predicting what humans would want if they knew what they really wanted? The concepts involved - “human”, “want”, “know”, and “really” - are easy for us to understand but impossible to specify in a programming language. But GPT-3 takes specifications in natural language, which is made of words, which point to nuanced human ideas, ideas like “explicitly”, “best interests”, “idealized”, and “utility function”. ◉
Of course, GPT-3 does not understand CEV nearly well enough to be entrusted with computing or implementing it (neither does any human), but the fact that it can begin to competently discuss the idea and its ramifications indicates an opportunity for prosaic alignment.
Currently, the following subtree is uncurated (I included all GPT-3’s responses that I generated without modification). This may change as I add more content.
From here on out, bold is human-written text; unformatted is GPT-3. The initial prompt is from this Lesswrong post.
Coherent Extrapolated Volition was a term developed by Eliezer Yudkowsky while discussing Friendly AI development. It’s meant as an argument that it would not be sufficient to explicitly program what we think our desires and motivations are into an AI, instead, we should find a way to program it in a way that it would act in our best interests – what we want it to do and not what we tell it to.
Related: Friendly AI, Metaethics Sequence, Complexity of Value
In calculating CEV, an AI would predict what an idealized version of us would want, “if we knew more, thought faster, were more the people we wished we were, had grown up farther together”. It would recursively iterate this prediction for humanity as a whole, and determine the desires which converge. This initial dynamic would be used to generate the AI’s utility function.
… For example,
… The result may be something very different than current human values, as Yudkowsky argues that our present values are too inconsistent to be ideal. In addition, the CEV value would include possible posthuman values, which may be significantly divergent from current human values.
Other arguments augment CEV to produce a larger picture of ideal decisions. If limited, as humans are, then CEV would not include all complements and limits of lived experience. Therefore, attempting to ground the goals of
… This anticipatory figuring of what a maximally-developed-human version of oneself would want would be the basis for making decisions in the future. This anticipatory dynamic would only be restricted by anthropic rules and limits such that the AI does not preclude its own existence and the existence of humanity-at-large There is a potential abuse of the dynamic, with the AI’s utility function not ending up converging on what we would want – converging on some desire to destroy the world – after making
… The choice of initial dynamic is crucial to Friendly AI, because the AI would want to be powerful enough to implement that dynamic.
Yudkowsky initialised that for his rough takeoff scenario in an input for a seed AI, one would ask: “What do you think would be a good way for us to share the universe? What do you think [I, the AI] should do in order to share the universe with you as effectively as possible?”
… Since the AI cannot be expected to fully predict what humans would want, the resulting utility function would be based on a probabilistic model, which could then be tweaked by the AI to map to actual human values over time. If the probabilistic model was able to predict the human preference accurately, then the model itself becomes the CEV, and the AI is happy to follow it.
… It would then recursively refine the output, by working backwards from the idealized predicted values.
A lot of people that talk about CEV seem to think that using a CEV-derived utility function would, conceptually, solve the AI control problem. The argument usually goes that if the AI was fed this utility function then it would be able to act in accordance to our wishes because it would know what we want it to, despite us not explicitly knowing ourselves.
… Utilitarianism as ethics of CEV (the AI’s morality) is the logical extreme of coherent extrapolated volition, with present desires heavily influencing future desires. Yudkowsky states that the creation of this metaphorical idealized human constitutes the emergence of a new person and has been argued to approach personhood.
For example, if you then ask the AI about what would be the best way to distribute a stack of cash between yourself and a stranger, the AI would do the extrapolation
… Doing this, CEV would be able to “encompass the intrinsic values of all human beings” in contrast to traditional decision theories, such as classical utilitarianism, which assume “a single, common value for all of humanity, which can then be optimized”. This single, common value would be called human preferences, “without connecting them to any particular human being’s implementation or execution”.