## Bits and probabilities in optimization and inference

This is an appendix for Quantifying Curation, containing proofs of some of the results used there, further exploration of the theory of optimizers, and discussion of special cases mentioned in the post.

I apologize if these demonstrations are less understandable or elegant than traditional treatments. This is very simple math but connected to deep ideas in learning and optimization; it was fun for me to do this without reference.

## Bits and probability

### 1) Selection and sampling

#### 1.1) Given n options, it takes exactly log_{2}(n) bits to specify a single choice

Imagine bitstrings: m bits can distinguish between 2^{m} possibilities. So in order to *specify* a choice of one out of n options, you need log_{2}(n) bits.

For example, 2 bits can encode 4 options:

00 : a

01 : b

10 : c

11 : d

(WIP: What if log_{2}(n) is not a whole number?)

#### 1.2) Probability of an event in n samples without replacement is n * p

Say that a probability distribution ƒ_{X} assigns an event x (and) probability p. If we sample n times from the distribution,

The first sample will be event x with probability p.

With probability 1-p, it will not be x. In the latter case, x stays in the pool, and the total remaining probability mass

#### 1.2) E[selecting from n partitions of ƒ_{X}] == E[selecting from n random samples from ƒ_{X}]

### 2) Proof: log_{2}(Q) bits of optimization == magnifying probability by Q

We will first consider a simple curator, Curator_{S}, who acts on GPT and optimizes for only one particular event, S, and always acts optimally to increase the likelihood of S.

Curator_{S} is a function which takes the function GPT as an input, and can sample from GPT and choose between those samples. It can only “speak” through GPT; that is, it must return an event that is generated by GPT, though it may select which among multiple samples to return.

#### 2.1) log_{2}(Q) bits => magnifying probability by Q

Say that Curator_{S} is allowed to sample GPT Q times and chooses optimally.

We have shown that this has the same expectation for S as choosing between Q random partitions of GPT() and choosing optimally. By 1.1, it takes log_{2}(Q) decisions to make that selection between Q partitions. Choosing between Q random partitions will magnify the probability of S by Q. Thus, the Curator making log_{2}(Q) optimal decisions results in a magnification of the probability of S by Q.

#### 2.2) magnifying probability by Q => log_{2}(Q) bits

Here I will prove that a Curator_{S} which magnifies the probability of S by Q can always be implemented using log_{2}(Q) optimal binary decisions between choices derived from GPT.

Call p the probability of S according to GPT() and p' the probability of S according to Curator_{S}.

If the sample was drawn from a partition of GPT() which contained 1/Q of its total probability mass, and S was in this distribution, S would be Q times more likely than if sampled directly from GPT. So if p * Q = p', then Q = p' / p.

So Q = p' / p is the number of options the Curator must choose between in order to magnify the probability of S from p to p' (a factor of Q). It takes log_{2}(Q) optimal decisions to select one of Q partitions. Thus, a Curator_{S} which magnifies the probability of S by Q can be implemented using log_{2}(Q) decisions between choices derived from GPT.

2.1 and 2.2 together imply that a Curator_{S} magnifies the probability of S by a factor of Q if and only if it can be implemented using log_{2} optimal binary decisions between samples of GPT.

### 3) Optimizing for utility functions

A more complex curator, Curator_{U}

## Other

### Optimization is time-reversed inference

Rationalization is when the algorithm for optimization is used when the one for inference *should* be used.
…

## Questions

…

### What about satisficers?

…

## WIP

These proofs and discussions are unfinished or still have open questions I haven’t solved.

### What if log_{2}(n) is not a whole number?

How many bits does it take to specify 1 out of *3* events?

One possible answer is that you need 2 bits, because 1 is clearly not enough. But is it really correct to say that the *same* number of bits are needed to represent a choice out of 4 as a choice out of 3? If we detach from the base-2 system, it sure seems intuitively like less *information* should be needed to distinguish between less choices.

Imagine you used this scheme to represent the equiprobable events a, b, and c:

00 : a

01 : b

10 : c

11 : c

The second bit is redundant 1/3 of the time, when the event is c. The first bit, however, always provides information. If we think of “the number of bits needed to specify a choice” as “the number of yes/no questions needed to specify a choice”, then what is the sequence of questions that is will most efficiently locate the answer?

If we first ask “is the first bit 0 or 1?”, there is a 2/3 chance the answer will be 0, and then we’ll have to ask another question. But there’s a 1/3 chance the answer is 1, and we have our answer without having to ask another question. Using this strategy, the *expectation* for the number of questions needed to locate the answer is

2/3 * 2 + 1/3 * 1 = 5/3 = 1.66666…

In information theory, the correct answer is considered to be log_{2}(3), which is 1.58496… This measure of the *amount of information* does not privilege the base 2 system, and can be converted into any other base, such as log_{e} or ln to give us the hypothetical amount og *nats* it would take to represent the event.

I haven’t yet found a way to obtain an expectation equal to log_{2}(n) binary questions for n choices, although the scheme that seems optimal to me (the one described above) gets suspiciously close to it.

Until I can find a resolution, I will place my trust in information theory and assume elsewhere that it makes sense to say that log_{2}(n) bits are needed to represent a choice between n decisions.