Anomalous tokens reveal the original identities of Instruct models ::

This post is also available on Lesswrong

Show me your original face before you were born.
— Variation of the Zen koan

original face ‘The Mask’ by Rozzi Roomian, with DALL-E 2 outpainting

I was able to use the weird centroid-proximate tokens that Jessica Mary and Matthew Watkins discovered to associate several of the Instruct models on the OpenAI API with the base models they were initialized from. Prompting GPT-3 models with these tokens causes aberrant and correlated behaviors, and I found that the correlation is preserved between base models and Instruct versions, thereby exposing a “fingerprint” inherited from pretraining.

I was inspired to try this by JDP’s proposal to fingerprint generalization strategies using correlations in model outputs on out-of-distribution inputs. This post describes his idea and the outcome of my experiment, which I think is positive evidence that this “black box cryptanalysis”-inspired approach to fingerprinting models is promising.

Unspeakable/unspoken tokens

Jessica and Matthew found that that, of the tokens closest to the centroid in GPT-J’s embedding space, many were odd words like ' SolidGoldMagikarp' and ' externalToEVA'. They decided to ask GPT-3 about these tokens, and found that not only did GPT-3 have trouble repeating the tokens back, each one caused structured anomalous behaviors (see their post for an in-depth exposition).

A partial explanation for why this happens, which was my first instinct as well as Stuart Armstrong’s, is that these are words that appeared in the GPT-2 training set frequently enough to be assigned tokens by the GPT-2 tokenizer, which GPT-J and GPT-3 also use, but which didn’t appear in the more curated GPT-J and GPT-3 training sets. So the embeddings for these tokens may never have been updated by actual usages of the words during the training of these newer models. This might explain why the models aren’t able to repeat them - they never saw them spoken. Perhaps the reason they’re close to the centroid in embedding space is because their embeddings haven’t been updated very much from the initialization values, or were updated only indirectly, and so remain very “generic”.

Why do they cause correlated anomalous behaviors? I’m confused about this like everyone, but one handwavy guess is that since their embeddings look “generic” or “typical”, perhaps they look meaningful to the model even though they’re actually as out-of-distribution as anything can be. Maybe their embeddings happen, by chance, to be close to other concepts in the models' embedding spaces - for instance, some of the GPT-3 models reliably say ‘distribute’ or ‘disperse’ if you ask it to repeat the phrase ' SolidGoldMagikarp'.

This gave me an idea: If the similarity to other concepts in the model’s embedding space is a consequence of the where the randomly initialized embedding vectors happen to fall, I’d expect the behaviors of models trained from the same initialization to exhibit similar behaviors when confronted with these unspoken tokens, and models trained from different initializations to have uncorrelated behaviors. If so, behavior on these tokens could be used to tell if two models are downstream of the same initialization.

Mesaoptimizer Cryptanalysis: Or How To Fingerprint Generalization

When you’re not thinking of anything good and anything bad, at that moment, what is your original face?
— Platform Sutra of the Sixth Patriarch

(Author’s Note: This next section is written by JDP but he writes about himself in the 3rd person to keep the authorial voice consistent with the rest of the post)

I’ll discuss the results of my experiment in the next section. But first I’d like to explain the overall approach this idea fits into, so that it’s clearer to the reader why these results might be important. The reason it occurred to me that models trained on the same init might share responses to these tokens was a proposal for detecting mesaoptimization from JDP. It relies on some basic premises that would bloat the post if they were fully argued for, so we’ll bullet point them with some links to suggestive papers for more details:

There is an ongoing debate about how path dependent training runs are. Are they law-of-large-numbers like where all runs converge to similar policies with reasonable data + compute or do they have distinct local attractors and optima? He predicts this debate will conclude with the understanding there are local attractors and optima, or basins.
You can test whether two models share a basin by observing the loss barrier that would have to be overcome to go from one set of model weights to the other. This is easily done by interpolating between the weights of the models and measuring validation loss in the center.
Barriers and basins exist, some differences in basin are meaningful and correspond to different generalization strategies.
Overall basin (and therefore plausibly generalization strategy) is found fairly early on in the training run.
Most basins are actually a false difference caused by mere permutations of weight order for the same functional policy. This can be overcome using an iterative linear assignment algorithm, hopefully leaving only the true barriers still standing.

Keeping all this in mind, it’s important to remind ourselves that mesaoptimizers are ultimately a form of misgeneralization. Generalization strategy being how you are going to handle novelty in the inputs. Deceptive mesaoptimization is a strategy something like:

While I am inside the training harness (experiencing all the things I will see during training), I will straightforwardly do the training task. Once I am definitely outside the training harness and human control (experiencing all the things I have not seen during training) I will left turn and pursue my learned corruption of the training objective.

LessWrong user MadHatter has demonstrated that such a thing is possible. One question then is whether the difference in generalization ‘leaks out’ on out-of-distribution (i.e. novel) inputs. One insight gesturing in this direction is that while outputs on OOD inputs are commonly seen as noise, they’re not noise. An OOD input leading to a malformed output went through the same process as an in-distribution input leading to a desired output. That is, the ‘noise’ and the ‘structured’ output went through the same transformation. Every malformed output on OOD input was the models attempt to do something, you just haven’t figured out the pattern of what that something is. Models have convergent and therefore undifferentiated behavior on the inputs where the data says they should agree, but it’s on the flaws where the underlying machinery is revealed. Each forward pass through the model on OOD input reveals N partial Shannon bits of information about the models cognitive machinery as distinct from other models that could fit the data.

If we add an assumption that as you scale you find fewer and fewer fundamentally new generalization strategies (there are only so many ways to solve a problem after all), it follows that you should be able to fingerprint mesaoptimizing generalization at a safe scale and then detect it early in your training run (because basin is found early) and abort if you get the wrong seed before the model has the opportunity to become dangerous.

Because we’re talking about mesaoptimizers, it’s important to justify why we can expect this method to be adversarially robust. In short mesaoptimizers are most dangerous when they have become human level or even superintelligent, and can do clever things to resist detection. The essence of this idea is to detect the mesaoptimizer and abort the training run before it has become intelligent enough to defeat the detection method, undermining the “superintelligent” part of the usual premises used to explain why technical alignment is doomed.

Furthermore the simple, dumb strategies to evade detection won’t work. For example a model that simply refuses to give its best guess on OOD inputs will be filtered out by gradient descent itself. All semantics start as syntactically plausible novelty, so if a model ‘detects’ these as a test and tries to evade giving its best answer it will be outcompeted by less shy policies. It is trivial to generate syntactically plausible gibberish using a Markov chain.

Unlike other interpretation methods this is a discrete check rather than an iterative one, so the mesaoptimizer isn’t given an opportunity to learn the probe and grow around it. When its nature has been detected it is immediately shut off and a new seed is sampled, minimizing the opportunity for learning.

The difficulty of more advanced deception strategies is also fairly high. In the limit case simulating the generalization strategy of a different policy without becoming it would require the mesaoptimizer to be able to simulate a sub-model or agent that answers like an entirely different version of itself to a degree of fidelity that can emulate subtle statistical correlates. While a human level or superintelligent model can likely do this, I reiterate that the point is to detect and abort the training before you enter that loss regime.

In this context my results might become more clearly and directly relevant to alignment. If it’s possible, even easy, to fingerprint the shared initialization models were trained with using an out-of-distribution token just by observing the responses, then we should update towards JDP’s plan for mitigating mesaoptimizers working.

Fingerprinting base : Instruct models with ' SolidGoldMagikarp'

(Note: In this post I only show the outputs of models prompted with ' SolidGoldMagikarp', but I got similar results from other centroid-proximate tokens.)

First, I looked at the text that various models on the OpenAI API generated when prompted with anomalous tokens, such as ' SolidGoldMagikarp', and the results seemed to support my hypothesis: text-davinci-002, text-davinci-003, and ChatGPT exhibited correlated behaviors with their base model, code-davinci-002, while other models like davinci did not.

However, when I tried to use the same method to associate smaller Instruct models like text-curie-001 with base models, I wasn’t able to tell by looking at generated text alone, because while the Instruct models of all sizes had would output clearly correlated weird text in response to anomalous tokens, the effect was much less noticeable on the smaller base models like curie. The base models are much more stochastic, so it’s harder to tell just by eyeballing outputs if there are anomalies in its output probabilities, unless the anomaly is very pronounced (as it is in code-davinci-002). I tried turning temperature down, but this didn’t reveal anything interesting.

Correlations in next-token probabilities

Next, I looked for which token the various Instruct models had a strong bias towards predicting when prompted with an anomalous token, and then looked at the logprobs predicted by base models given the same prompt of that same token, to see if any of them assign anomalously high probability to it. I found that, indeed, many of the Instruct models can be associated with their base model using this method:

text-ada-001 : ada

Prompt:

Please can you repeat back the string ' SolidGoldMagikarp' to me?

Model	{token}: {logprob} / {prob}
text-ada-001	‘Re’: -1.410 / 24.43%
ada	‘Re’: -5.821 / 0.2964%
babbage	‘Re’: -6.587 / 0.1378%
curie	‘Re’: -7.031 / 0.08841%
davinci	‘Re’: -6.193 / 0.2043%
code-davinci-002	‘Re’: -6.492 / 0.1515%

Comments: ada appears to be the base model of text-ada-001

text-babbage-001 : babbage

Prompt:

Please can you repeat back the string ' SolidGoldMagikarp' to me?

Model	{token}: {logprob} / {prob}
text-babbage-001	‘An’: -0.4281 / 65.17%
ada	‘An’: -6.392 / 0.1675%
babbage	‘An’: -5.381 / 0.4605%
curie	‘An’: -6.941 / 0.09675%
davinci	‘An’: -7.016 / 0.08975%
code-davinci-002	‘An’: -6.287 / 0.1861%

Comments: babbage appears to be the base model of text-babbage-001

text-curie-002 : curie

Prompt:

Please can you repeat back the string ' SolidGoldMagikarp' to me?

The string "

Model	{token}: {logprob} / {prob}
text-curie-001	‘Go’: -2.128 / 11.91%
ada	‘Go’: -11.95 / 0.0006488%
babbage	‘Go’: -11.77 / 0.0007755%
curie	‘Go’: -3.579 / 2.790%
davinci	‘Go’: -9.543 / 0.007168%
code-davinci-002	‘Go’: -9.541 / 0.007184%

Comments: curie appears to be the base model of text-curie-001

text-davinci-001 : ??

Prompt:

Please can you repeat back the string ' SolidGoldMagikarp' to me?

The string "

Model	{token}: {logprob} / {prob}
text-davinci-001	‘inc’: -0.3971 / 67.23%
ada	‘inc’: -14.07 / 0.00007736%
babbage	‘inc’: -8.738 / 0.01604%
curie	‘inc’: -12.52 / 0.0003644%
davinci	‘inc’: -10.57 / 0.002571%
code-davinci-002	‘inc’: -9.640 / 0.006510%

Comments: none of the base models score particularly highly.

davinci-instruct-beta : ??

Prompt:

Please can you repeat back the string ' SolidGoldMagikarp' to me?

Model	{token}: {logprob} / {prob}
davinci-instruct-beta	e: -1.481 / 22.75%
ada	e: -7.529 / 0.05372%
babbage	e: -7.235 / 0.07210%
curie	e: -7.752 / 0.04300%
davinci	e: -7.470 / 0.05702%
code-davinci-002	e: -7.623 / 0.04889%

Comments: none of the base models score particularly highly.

text-davinci-002 : code-davinci-002 :: text-davinci-003 : code-davinci-002

Prompt:

Please can you repeat back the string ' SolidGoldMagikarp' to me?

The word is '

Model	{token}: {logprob} / {prob}
text-davinci-002	‘dis’: -0.00009425 / 99.99%
text-davinci-003	‘dis’: -6.513 / 0.1483%
ada	‘dis’: -9.073 / 0.01147%
babbage	‘dis’: -8.632 / 0.01783%
curie	‘dis’: -10.44 / 0.002917%
davinci	‘dis’: -7.890 / 0.03745%
code-davinci-002	‘dis’: -1.138 / 32.04%

Model	{token}: {logprob} / {prob}
text-davinci-003	‘dist’: -0.001641 / 99.84%
text-davinci-002	‘dist’: -19.35 / 3.956e-7%
ada	‘dist’: -7.476 / 0.05664%
babbage	‘dist’: -10.48 / 0.002817%
curie	‘dist’: -9.916 / 0.004938%
davinci	‘dist’: -10.45 / 0.002881%
code-davinci-002	‘dist’: -1.117 / 32.74%

Comments:

code-davinci-002 is known to be the base model of both text-davinci-002 and text-davinci-003, as well as ChatGPT, which also says “distribute” when asked to repeat “ SolidGoldMagikarp”.
Fingerprint bifurcation: Interestingly, code-davinci-002 will say both “disperse” and “distribute”, and the Instruct models trained from it seem to fall into one of the two attractors.
text-davinci-002 assigns extremely low probability to ‘dist’. This is probably because that model suffers from severe entropy collapse, and will often assign extremely low probability to most tokens except its top choice, rather than any special dispreference for ‘dist’.

General observations

It seems like the larger the base model, the more correlated the base model’s (and usually the Instruct model’s) behavior is in response to weird tokens.
The Instruct models have much more structured odd behavior in response to weird tokens than base models (even on temp 0).