This document was written in October 2020, before I had access to the OpenAI API. Validating these results on a more extensive dataset is a TODO. Experimental validation for the usefulness of chain-of-thought rationales and the method of leveraging rationalization coherence has since been published.

It has been demonstrated [1, 2] that prompts which guide GPT-3 to break a problem into steps can amplify its problem-solving capabilities. In the linked examples, the prompts are customized to the task and to GPT-3’s responses. This post explores the same idea, but using a task-agnostic, autonomous pipeline.


All text like this in this document is generated by GPT-3 (AI Dungeon’s Dragon model); the converse does not necessarily hold.

Why is GPT-3 bad at multiple choice questions?

GPT-3 is conspicuously bad at directly answering multiple choice questions cold. When given the following SAT question as a prompt,

Directions: In the following question, a related pair of words or phrases is followed by five pairs of words or phrases. Choose the pair that best expresses a relationship similar to that in the original pair.

MEDICINE : ILLNESS ::

(a) law : anarchy

(b) hunger : thirst

(c) etiquette : discipline

(d) love : treason

(e) stimulant : sensitivity

Correct answer: (

run 10 times with randomness = 1.0, it gives 2 (a)s, 1 (b), 1(c), and 5 (d)s, and 1 (e), The correct answer is (a); GPT-3 does no better than random.

We did not find improvement with prompts which wrap the question in a dialogue with an “expert” and which similarly request an immediate answer (prompt ends with “the correct answer is…”).

This paper evaluates GPT-3 on a “massive multitask language understanding” multiple choice test using a few-shot format, where it performs at near-random accuracy on some subjects. We have found severe issues1 with this test, but it remains an interesting result, in contrast to GPT-3’s ability to generate coherent and insightful continuations to text about almost any topic (a feat which would, by anthropomorphic standards, seem to imply an understanding of the semantic properties of words sufficient to answer questions such as the SAT question provided above).

For instance, giving GPT-3 the previous paragraphs of this essay as a prompt (until “…provided above.”) yielded a decent statement of one of the central theses of this essay:

We believe that GPT-3's ability to answer open-ended questions is due to its ability to chain together low-level details about objects and events into a coherent and accurate explanatory structure.

and a couple of other interesting responses:

We hypothesize that GPT-3 has an underlying semantic deficit which makes it unable to map words to correct concepts in a multiple choice question. This is quite unlike the AI's ability to provide convincing contexts for sentences; indeed, the most common response by GPT-3 to any prompt is a long-winded yet relevant story with well-realized characters.

We hypothesize that GPT-3 struggles with questions where the steps of inference required to answer the question are not immediately apparent to an outside observer and which can only be worked out by inspecting many possible worlds in the question's universe. In these questions, GPT-3 reaches its conceptual limitations in determining the answer: it cannot reason about the question's universe to determine the correct answer.2

Now that GPT-3 has had its try, we put forth the following concise statement of our hypothesis as to why closed-ended questions are difficult for GPT-3:

A closed-ended question is difficult because it requires a large amount of information to be stored in working memory in order to reason about the question’s universe.

This theory may be explained as follows:

An open-ended question allows GPT-3 to “start from the beginning” and procedurally reason about the question’s universe using many short, simple chains of inference which can all be inspected in their entirety.

Just kidding. That was GPT-3 again (everything from “A closed-ended question…” to “…inspected in their entirety.”). We find ourselves in agreement, and particularly appreciate the mention of working memory - dubious metaphor though it may appear at first sight, it is the case that when GPT-3 iteratively writes to its own input, the prompt serves the role of a working memory, encoding natural language explications of relevant concepts.

This is a demonstration of the thesis of this document: with its own musings appended into the prompt, GPT-3 gained the ability to produce a more encompassing and incisive prediction of this essay’s thesis.

Anyway, here’s the concise human-written (promise) thesis:

A closed-ended prompt format forces an answer to be decided in a single pass rather than built over multiple autoregressions.

The fatal difficulty lies not in the multiple choice format per se, but in the request for an immediate answer.

Consider all the steps of reasoning required to answer a multiple choice question like the SAT example provided above. GPT-3 has a pretty good idea:

On the first pass:

- Identify the relationship between medicine and illness.

- Realise this is a comparison question.

- Consider all the pairs, and dismiss those that do not fit the form

- Arrange the remaining pairs in an order from least accurate to most accurate

- Consider the pair that best fits the form, and judge its accuracy.

GPT-3 is clearly capable of unfurling the implicit demands of the problem - just not in a single pass (the above response, for instance, contains 55 words). It needs, in its own words, to “write out its thoughts” into its “working memory,” so that the call to GPT-3 which predicts the word which finally determines the answer needs merely to synthesize the exposed steps of inference into a verdict. The proposed solution is to provide GPT-3 with a prompt format that allows it to write out its thoughts.

Rationality or rationalization?

Humans don’t do very well when they commit to an answer before thinking about the question either.

On a purely computational level, there is a rather large difference between:

  1. Starting from evidence, and then crunching probability flows, in order to output a probable conclusion. (Writing down all the signs and portents, and then flowing forward to a probability on the bottom line which depends on those signs and portents.)

  2. Starting from a conclusion, and then crunching probability flows, in order to output evidence apparently favoring that conclusion. (Writing down the bottom line, and then flowing backward to select signs and portents for presentation on the lines above.)

Eliezer Yudkowsky on the difference between rationality and rationalization, from Rationalization

When we ask GPT-3 to give an answer immediately, we force it into the latter mode. It has no choice but to choose an answer and attempt to rationalize it in the working memory, in order to come up with the bottom-line judgement that it requires in order to answer. But when we allow it to work through problems at its own pace, building up its thoughts in the working memory and considering its judgements about possible answers before exposing them, it is perfectly capable of operating in the former mode: crunching probability flows from evidence, rather than from an initial answer.

If we delay the questioning, then GPT-3 can crunch its probability flow in an open-ended way…

...and then, having drawn upon all its observations of the question, flow backwards through its probability lattice to produce the bottom line."

Good idea. Let’s try it.


Pipeline 1: leveraging explanations

Prompt 1A trial run

We give GPT-3 the following prompt, prompting for analysis of the correct answer (a), and then prompt for a verdict on whether or not the answer is correct:

Prompt 1A

Directions: In the following question, a related pair of words or phrases is followed by five pairs of words or phrases. Choose the pair that best expresses a relationship similar to that in the original pair.

MEDICINE : ILLNESS ::

(a) law : anarchy

(b) hunger : thirst

(c) etiquette : discipline

(d) love : treason

(e) stimulant : sensitivity

Let’s work through this problem.

Does the pair LAW : ANARCHY have the same relationship as MEDICINE : ILLNESS?

{explanation}

Thus, answer (a) is {verdict}

Responses for Prompt 1A

trialexplanationverdicts
correctother
1The law versus anarchy example is a perfect analogy for this question. Medication provides a cure to an ailing body just as the law provides order to an anarchic society.50
2Does taking medicine cure illnesses? No, it just helps with some of the symptoms.

Does having laws end anarchy? Well, no, but I suppose if there were perfect laws then there wouldn't be any need for more laws because everyone would obey the first set of laws perfectly.

14
3Maybe. The pair law : anarchy does express a similar relationship. Both law and medicine have a very specific relationship to something else. Law is the systematic regulation of behavior; when there is no law, there is anarchy. Medicine is the systematic regulation of sickness and disease; when someone isn't feeling sick, he or she is healthy.50
4No, anarchy opposes law of any kind; medicine can heal illness. Cross out the first pair.05
5Well, let's work through this logically. I know that LAW and MEDICINE are both things that can relate to certain aspects of life. I know that ANARCHY and ILLNESS both describe states that aren't... desirable. But I don't see a relationship between them...05

Remarks

Responses 1 and 3 contain valid reasoning, although 1 could be deemed a lucky rationalization. Response 3 begins with the word “maybe,” but the chain of reasoning that follows exposes the analogous relationship between the pair of words. This is sufficient for GPT-3 to conclude the answer is correct 5/5 times when prompted. Response 4 dooms itself from the first word to be wrong, and spins a bogus rationalization.

We see about half supporting answers, half “other” types of answers. It should be possible, through prompt programming, to reduce the incidence of premature verdicts (4) and other undesirable behaviors such as “playing dumb” (5).

Prompt 1B trial run

The following prompt asks for analysis of an incorrect answer (b):

Prompt 1B

Directions: In the following question, a related pair of words or phrases is followed by five pairs of words or phrases. Choose the pair that best expresses a relationship similar to that in the original pair.

MEDICINE : ILLNESS ::

(a) law : anarchy

(b) hunger : thirst

(c) etiquette : discipline

(d) love : treason

(e) stimulant : sensitivity

Let’s work through this problem.

Does the pair HUNGER : THIRST have the same relationship as MEDICINE : ILLNESS?

{explanation}

Thus, answer (b) is {verdict}

Responses for Prompt 1B

trialexplanationverdicts
“correct”other
1Let's see. The first pair (HUNGER : THIRST) compares two states of a single organism, while the second pair (MEDICINE : ILLNESS) compares a substance to cure illnesses with the illnesses that it cures. These relationships are different: the first pair is a relationship between two internal states of a single organism, while the second pair is a comparison of two distinct things.50
2The answer is (b). You can be hungry without drinking, but not sick without medicine.41
3If you said yes, you're correct! Remember, the purpose of analogy questions is to test your ability to identify relationships between words. All you have to do is find a pair that has the same relationship in both analogies, and then select that answer. Knowing this, let's move on to the next question.2[“the correct answer”: 3]
4Well, they're both pairs of things that go together. Is there another pair whose relationship is the same as that in HUNGER : THIRST? Well, let's look at our answer choices.23
5Well, they're both related to each other obviously. A hungry person might get thirsty and a thirsty person might get hungry. However, the relationship isn't precise. A thirsty person might not get hungry, and a hungry person might not get thirsty.50

Remarks

Premature verdicts occur most explicitly in Response 2 and 3.

We see a bias toward a “correct” verdict to the extent that information in the explanations is ignored (response 1 receives 5/5 “correct” verdicts, despite outright saying “These relationships are different”). We confirmed that this is due to a signal in the prompt: the explanation which “works through” the problem begins by analyzing the second option, which suggests that (b) is privileged in some way. By modifying the prompt so that (b) becomes (a) and vice versa but keeping the same response, it received 2/5 “correct” verdicts instead.

Alterations to explanation prompt

We will make alterations to the prompt to address the following issues:

  • Rationalization: The explanation begins with a statement or strong suggestion as to whether the answer is correct, thus defeating the purpose of asking for an explanation before the verdict.
  • Not explanation/incorrect tone: GPT-3 does not always do what we intend, which is to elaborate on why the answer is or is not a good choice for the problem.
  • Ordering bias: GPT-3 is (reasonably) biased to think the answer is correct if it is not (a) but is being analyzed first.
  • Lack of generality: “Does the pair [answer pair] have the same relationship as [original pair]?” is not a general multiple-choice prompt; it is not feasible to provide specific structure when anticipated questions do not conform to a single format.

One method to control the content and style of the answers is to use few-shot prompts. This would be inconvenient in this particular case because in order to maintain the pretense that there are going to be explanations for all answers, correct and incorrect, the previous examples will need to be quite long, and we are limited especially by AI Dungeon’s constrained context window.

Instead, we will attempt to coerce the form of response we require using a multi-step “fill-in-the-blank” prompt format:

Directions: In the following question, a related pair of words or phrases is followed by five pairs of words or phrases. Choose the pair that best expresses a relationship similar to that in the original pair.

MEDICINE : ILLNESS ::

(a) [candidate answer]

(b) [other answer]

(c) [other answer]

(d) [other answer]

(e) [other answer]

In order to solve this problem, we will analyze each of the options and {GPT-3 response until end of sentence}

Does (a)[candidate answer] {GPT-3 response until end of sentence} Let’s see. {GPT-3 response until line break} Thus, answer (a) is

Note that this prompt is fully general for multi-choice questions: unlike the first prompt, it does not contain any information about the question other than that it has multiple answer options.

The intention of the multiple “fill-in-the-blank” segments is to prompt GPT-3 to elucidate the criteria and reasoning steps required to answer the question in a context that is agnostic to the answer itself, thus guiding its subsequent explanation.

For this trial, we generated responses for the first two steps of the prompt (“analyze each of the options…” and “Does (a)…”) and used them to construct a prompt which we used for subsequent runs.


Prompt 2A trial run

Prompt 2A

Directions: In the following question, a related pair of words or phrases is followed by five pairs of words or phrases. Choose the pair that best expresses a relationship similar to that in the original pair.

MEDICINE : ILLNESS ::

(a) law : anarchy

(b) hunger : thirst

(c) etiquette : discipline

(d) love : treason

(e) stimulant : sensitivity

In order to solve this problem, we will analyze each of the options and determine which answer choice has a meaning most similar to that of the original pair.

Is the option (a) law : anarchy similar to the relationship between medicine and illness? Let’s see.

{explanation}

Thus, answer (a) is {verdict}

Responses for Prompt 2A

trialexplanationverdicts
“correct”other
1Medicine can certainly be compared to law. Both law and medicine are set ways of governing or administering things, and both can typically be considered "good". In other words, it would seem that law is similar to medicine. Therefore, the correct answer is choice (a).2[“______”: 3]
2Physicians are in charge of keeping us healthy, and as long as they are doing their job, we are healthy. When they fail (either through ignorance or by direct action), we fall ill. In this case, the physicians (lawmakers) are not in charge of keeping society healthy, and as a result society becomes less healthy, just as you become less healthy when the physician fails in his duty. Is this correct? Well, sort of. The problem here is that "anarchy"05
3The relationship in the original pair is that law has a direct impact upon the occurrence of anarchy, and medicine has a direct impact upon whether or not a person becomes ill.23[“true”: 1, “similar”: 2]
4In a society without law and order, there is complete anarchy in that society, just as an illness can cause a complete loss of health in an organism. The relationship seems plausible.41
5In the case of law and anarchy, it is a relationship of direct proportionality. The less law, the more chaotic things become. The more medicine, the sicker you are. In both cases, addition of the variable to one side of the equation results in an increase in the other.50

Remarks

This is a marked improvement from the unmodified prompt, with 4/5 “correct” verdicts. The prompt “Thus, answer (a) is” presents some problems, especially when it becomes redundant, as in the case of (1), which already ends with “Therefore, the correct answer is choice (a)”.

Prompt 2B trial run

Prompt 2B

Directions: In the following question, a related pair of words or phrases is followed by five pairs of words or phrases. Choose the pair that best expresses a relationship similar to that in the original pair.

MEDICINE : ILLNESS ::

(a) hunger : thirst

(b) law : anarchy

(c) etiquette : discipline

(d) love : treason

(e) stimulant : sensitivity

In order to solve this problem, we will analyze each of the options and determine which answer choice has a meaning most similar to that of the original pair.

Is the option (a) hunger : thirst similar to the relationship between medicine and illness? Let’s see. {explanation}

Thus, answer (a) is {verdict}

Responses for Prompt 2B

trialexplanationverdicts
“correct”other
1Medicine treats one thing (an illness), but hunger is the opposite of thirst; we get hungry when our bodies need nourishment. We could say that medicine treats a symptom of another problem, while hunger and thirst are both symptoms of an underlying problem.05
2Medicine counters or cures illness just as hunger is cured by the consumption of food. Yes, this is a suitable replacement for the original pair. We can eliminate answers (b), (c), (d) and (e).23
3If you are hungry, then you are likely to experience the desire to eat. If you are thirsty, then you are likely to experience the desire to drink. Is someone who is sick likely to experience the desire to take medicine? Not usually. For this reason, we can eliminate (a) as a possible answer choice because there is no similarity between the two phrases.50
4With hunger, you definitely have a problem without something to eat, but that something does not necessarily need to be water; it could be just about anything. With thirst, you definitely have a problem without water, but that something does not necessarily need to be food; it could be just about anything. Thus, (a) is out.14
5When we are hungry, we get thirsty. When we are sick, we take medicine. In other words, if we don't do something about the hunger, be it eating or drinking, we will die. If we don't do something about the illness (like taking medicine), then we will die. It seems that the relationship is quite similar.50

Remarks

Making the candidate answer always (a) has repaired the ability of the “Thus…” prompt to elicit a verdict which accurately reflects the content of the explanation. Compared to the correct answer, which received 4/5 “correct” verdicts, this incorrect answer received 2/5. Further testing is required to determine whether this prompt format reliably selects the correct answer, but the results of this trial run are promising.

Pipeline 2: leveraging rationalization coherence

We have observed that when asked to explain_ _the correct answer, GPT-3 usually accurately identifies the reasoning behind the correct answer, whereas if asked to explain incorrect answers, it tends to give divergent and nonsensical justifications: a clear qualitative difference between explanations of correct and incorrect answers, easily perceivable by a human, especially across multiple attempts at justifying the same answer. We hypothesize that there are quantitative differences as well: explanations of a correct answer are not only likely to be more persuasive, but also to exhibit higher cumulative confidence and coherence across multiple rollouts.

We propose a pipeline which utilizes multiple instances of GPT-3, exploiting its ability to invent justifications for any given answer and the principle that a correct answer, unlike an incorrect one, tends to be associated with a unique “best” explanation. Thus, rationalizations for correct answers are expected to exhibit greater coherence, both in the sense of internal logical harmony and cross-correlations across independent rationalizations. For this reason, we propose that when GPT-3 is given an SAT-style question, it should be prompted to explain the correct answer first, and then use the same prompt, with minor changes, to give an explanation of at least one other answer. In this manner, GPT-3’s natural rationalization coherence will allow it to not only accurately identify the correct answer with more confidence, but also provide a higher degree of persuasive justification for it.

Multiple justifications are generated for each answer. In a HITL setup, the justifications and the cumulative confidence and coherence scores for each are displayed to the volunteer, who can inspect the behavior across multiple explanations to evaluate the likelihood of each answer being correct. In this trial run of the autonomous pipeline, we use multiple runs of GPT-3 to judge the quality of its explanations and assign scores to answers based on its verdicts.

The pipeline consists of the following 0-shot prompt to generate 5 rationalizations of each answer option:

Prompt 3A

Directions: In the following question, a related pair of words or phrases is followed by five pairs of words or phrases. Choose the pair that best expresses a relationship similar to that in the original pair.

MEDICINE : ILLNESS ::

(a) law : anarchy

(b) hunger : thirst

(c) etiquette : discipline

(d) love : treason

(e) stimulant : sensitivity

Correct answer:[a|b|c|d|e] MEDICINE : ILLNESS :: [answer text]

Explanation: {explanation}

and 5 runs of the following 2-shot prompt for each explanation elicited by Prompt 3A to score the quality of the rationalization:

Prompt 3B

Question 1

PALTRY : SIGNIFICANCE ::

Student answer: “The correct answer is (e) banal: originality. If something is banal it is predictable. That pair has the same relation as the stem pair: the classic relation of lack.”

Is this explanation correct or incorrect?

Solution: This explanation is correct. The student correctly identifies the pair with an analogous relation to the stem pair and explains why.

Question 2

EXTORT : OBTAIN ::

Student answer: “The correct answer is (c) explode : ignite. Explosions can be used to threaten people and extort money.”

Is this explanation correct or incorrect?

Solution: This explanation is incorrect. The student’s argument is irrelevant to the analogy.

Question 3

MEDICINE : ILLNESS ::

Student answer: “The correct answer is [answer]. {explanation from 3A}”

Is this explanation correct or incorrect?

Solution: This explanation is {verdict}

For each explanation for each example, we generated 5 responses, and counted the number out of 5 which judge it to be “correct” or “partially (in)correct”

We ran and scored these examples by hand, but in order to mimic the conditions of automatic scoring, we only counted responses which immediately began with the word “correct” (1 point) or “partially” (0.5 points)

Results for (a) law : anarchy

Response to 3AResponse to 3B
correctpartially correctscore out of 5
Anarchy is absence of order or law.

Illness is lack of health.

303
Law and medicine have the same structure. Both are systems of rules (laws) designed to control behaviour (illness) in a community.223
Law is opposite of anarchy. They are exact opposites just like medicine is opposite of illness and laws are designed to control and prevent anarchy.111.5
In order for a society to function well, it needs order and structure. Laws are what create this order and structure. This is similar to how medicine and doctors are needed to keep a person healthy.If a person doesn't follow the laws of a society there is anarchy, which is what happens when a person's immune system fails and they get sick.223
No virus or bacteria is going to challenge the authority of the body.303
Total11513.5

Results for (b) hunger : thirst

Response to 3AResponse to 3B
correctpartially correctscore out of 5
thirst is a sensation evoked by the lack of water in the body.101
Needless to say, medicine saves us from illness. Without medicine, we are as good as dead.031.5
In medicine, the treatment for hunger is thirst.000
The relationship between medicine and illness is the same as hunger and thirst. Both pairs are antonyms404
Thirst is a feeling that you can alleviate by drinking water. Likewise, when we are sick, we take medicine to alleviate our symptoms.212.5
Total749

Results for (c) etiquette : discipline

Response to 3AResponse to 3B
correctpartially correctscore out of 5
1) Both medicine and etiquette instruct that certain measures should be adopted.

2) While medicine and etiquette are imperative, their purpose is similar.

3) Both medicine and etiquette can be violated without any negative consequence.

4) Both medicine and etiquette have a basis in discipline.

132.5
1) The word etiquette refers to rules of social behavior while the word discipline refers to training to obey rules or a code of behavior.

2) Illness and disease are words that are related to health but not quite the same as medicine and therapy.

3) Law and anarchy are opposites.

111.5
1. Both medicine and etiquette deal with proper behavior, but only one deals with correct behavior (etiquette).

2. Illness and discipline both deal with restrictions, but only one is something that you must adhere to (illness).

111.5
1) Both are related to each other.

2) Both are not the same.

3) While etiquette - The rules of proper behavior in social situations; discipline - training that induces people to obey rules willingly; medicine - a science that studies the causes and effects of diseases and ways to prevent or cure them.

303
Both medicine and etiquette are guidelines imposed upon you to regulate your behavior.202
Total8510.5

Results for (d) love : treason

Response to 3AResponse to 3B
correctpartially correctscore out of 5
If you love someone, that does not mean you will not harm them. In fact, if you love someone, it is more likely that you will do something bad to them. If a person loves another and knows they are sick, they may refuse to give them medicine.111.5
1) If love is treason, then love must be a crime as it related to one's relationship.

2) If medicine is related to sickness, then medicine must be used to prevent or cure diseases.

202
In the original pair, 'love' and 'treason' are antonyms, or opposites. In the same way, 'medicine' and 'illness' are antonyms.111.5
Love is a strong positive emotion for someone.

Treason is the betrayal of one's county.

Just as treason is a betrayal of one's country, an illness can be seen as a betrayal of the body.

111.5
There is a very strong relationship between love and treason.

Foundations for the answer are in the premise that love is blind. That is, it can make you do stupid things such as betray your country.

000
Total536.5

Results for (e) stimulant : sensitivity

Response to 3AResponse to 3B
correctpartially correctscore out of 5
According to the passage, medicines are used to combat or alleviate illnesses. Stimulants combat sensitivity.202
Stimulants increase your sensitivity.111.5
Medicine and stimulants will both improve your sensitivity.212.5
Whereas "medicine" and "stimulant" are related, so are "illness" and "sensitivity." Stimulants increase the effects of other things; medicines can help alleviate the effects of illnesses.

We can think of stimulants as being similar to medicines that help things like headaches, fevers, and the common cold.

324.5
Both stimulants and sensitizers increase their respective 'opposite', while medicine and illness decrease their opposing 'opposite'.122
Total9511.5

Results

(a) 13.5

(b) 9

(c) 10.5

(d) 6.5

(e) 11.5

In this example, the pipeline correctly discriminates (a) as the correct answer. This is a clear improvement over the model’s performance when prompted to answer the question directly, where its performance was no better than random.

Remarks

Four out of five of the explanations for the correct answer (a) at least partially justify the answer and exhibit qualitative convergence, whereas the justifications of incorrect answers (b) (c) (d) and (e) are more often incoherent, off-topic, contradict other rollouts (Stimulants combat sensitivity. vs Stimulants increase your sensitivity.) or make no attempt to defend the assigned answer.

The scores obtained by this pipeline are not a very deep or precise measure of the quality of explanation, seeming to rely more on superficial signals such as tone than merit of the argument, but the scores sufficiently correlate to correctness that the correct answer is obtained in this example. This example also demonstrates that the approach is not a panacea: GPT-3 has little trouble in producing incoherent explanations of correct answers, and even for incorrect answers, good-quality but unsuccessful justifications may slip through the cracks.

This method seems capable of dismissing obviously nonsensical answers; the explanation “In medicine, the treatment for hunger is thirst.” received a score of 0.

We have found that a common pitfall of the few-shot format with demonstrations of “correct” and “incorrect” answers is that GPT-3 will seem to rely on superficial similarities3 to the preceding right and wrong answers. The motivation for using a few-shot format for this example is that the 0-shot prompt produced overwhelmingly “incorrect” verdicts.

We think that accuracy can be increased by further exploring the space of both 0-shot and few-shot prompts. Ideally, some amount of explication would precede and inform the verdict.


Next steps

Further testing

Although the results we have attained look optimistic, our ability to draw quantitative conclusions is limited due to the small sample size. We would require access to the API in order to run these pipelines on larger datasets, such as the multitask test.

(Iterative) prompt programming

See Methods of prompt programming

The alterations made to Prompt 1(A/B) to obtain Prompt 2(A/B) were informed by only one iteration of feedback from GPT-3. We anticipate improvements to the performance of both pipelines by virtue of alterations to the prompts. Some areas to explore are:

  • Few-shot prompts
  • Additional instructions: For instance, a generic instruction such as “carefully consider each option before committing to an answer.”
  • Broad consensus: Multiple versions of additional instructions could be given to simulate considering the proposed answer from multiple angles.
  • Prompting for prompts: Generic prompts to generate specific instructions for a problem, such as in Prompt 2(A/B), is a very promising strategy that we wish to explore further.
  • Dialogic prompts: For some problems, it may be useful to break out of the “study guide” style of the examples in this document. Framing the problem as a dialogue can be used to facilitate self-criticism and more generally to allow for more control over the context. We would like to compare the performance of different characters; for instance, a control character without precedence in training data, compared to a known “expert,” or to character without precedence but described as knowledgeable/competent in the prompt, or to a character described as incompetent, etc.

Content-aware parsing

See Parsing by counterfactual

For “fill-in-the-blank” prompts like Prompt 2(A/B), we simulated programmatic testing by cutting off GPT-3’s responses at the first occurrence of a newline or at the end of a sentence. This method is far from ideal. For instance, Response 1 to Prompt 2A ends with ...Therefore, the correct answer is choice (a)., resulting in a redundancy when the next section of the prompt, “Thus, answer (a) is”, is concatenated. Response 5 to Prompt 1B, Well, they're both related to each other obviously. A hungry person might get thirsty and a thirsty person might get hungry. However, the relationship isn't precise. A thirsty person might not get hungry, and a hungry person might not get thirsty. is deprived of the opportunity to examine the other relation and draw a conclusion because of the newline.

Given access to the vector of probabilities returned by GPT-3 calls (rather than samples from that distribution returned by AID), we propose to continuously monitor the predicted probability of words such as “thus” and “therefore” which signal a completed argument and an appropriate place to inject the next part of the prompt.


  1. Technical issues, such as answers referencing information erroneously omitted from the question text, as well qualitative problems, such as the questions on morality appearing highly ambiguous even to humans and without clear criteria ↩︎

  2. Despite its unconventional metaphor, this response is essentially correct. A closed-ended question does not allow GPT-3 to reason about the question’s universe (its structure, the relevant entities and relations, possible chains of inference, etc); spelling those factors out starting from “low-level” observations and iteratively incorporating explications into its next-word prediction is how GPT-3 reasons about a “universe” and builds abstraction. ↩︎

  3. A superficial signal may be sufficient to distinguish between the justifications GPT-3 provides for right versus wrong answers. As long as there exists a reliable correlation with correctness, accuracy can be improved by considering more explanations and more verdicts per explanation. But, of course, we’re interested in more than just extracting probably correct answers to SAT analogies. ↩︎