Bits and probabilities in optimization and inference ::

This is an appendix for Quantifying Curation, containing proofs of some of the results used there, further exploration of the theory of optimizers, and discussion of special cases mentioned in the post.

I apologize if these demonstrations are less understandable or elegant than traditional treatments. This is very simple math but connected to deep ideas in learning and optimization; it was fun for me to do this without reference.

Bits and probability

1) Selection and sampling

1.1) Given n options, it takes exactly log₂(n) bits to specify a single choice

Imagine bitstrings: m bits can distinguish between 2^m possibilities. So in order to specify a choice of one out of n options, you need log₂(n) bits.

For example, 2 bits can encode 4 options:

00 : a

01 : b

10 : c

11 : d

(WIP: What if log₂(n) is not a whole number?)

1.2) Probability of an event in n samples without replacement is n * p

Say that a probability distribution ƒ_X assigns an event x (and) probability p. If we sample n times from the distribution,

The first sample will be event x with probability p.

With probability 1-p, it will not be x. In the latter case, x stays in the pool, and the total remaining probability mass

1.2) E[selecting from n partitions of ƒ_X] == E[selecting from n random samples from ƒ_X]

2) Proof: log₂(Q) bits of optimization == magnifying probability by Q

We will first consider a simple curator, Curator_S, who acts on GPT and optimizes for only one particular event, S, and always acts optimally to increase the likelihood of S.

Curator_S is a function which takes the function GPT as an input, and can sample from GPT and choose between those samples. It can only “speak” through GPT; that is, it must return an event that is generated by GPT, though it may select which among multiple samples to return.

2.1) log₂(Q) bits => magnifying probability by Q

Say that Curator_S is allowed to sample GPT Q times and chooses optimally.

We have shown that this has the same expectation for S as choosing between Q random partitions of GPT() and choosing optimally. By 1.1, it takes log₂(Q) decisions to make that selection between Q partitions. Choosing between Q random partitions will magnify the probability of S by Q. Thus, the Curator making log₂(Q) optimal decisions results in a magnification of the probability of S by Q.

2.2) magnifying probability by Q => log₂(Q) bits

Here I will prove that a Curator_S which magnifies the probability of S by Q can always be implemented using log₂(Q) optimal binary decisions between choices derived from GPT.

Call p the probability of S according to GPT() and p' the probability of S according to Curator_S.

If the sample was drawn from a partition of GPT() which contained 1/Q of its total probability mass, and S was in this distribution, S would be Q times more likely than if sampled directly from GPT. So if p * Q = p', then Q = p' / p.

So Q = p' / p is the number of options the Curator must choose between in order to magnify the probability of S from p to p' (a factor of Q). It takes log₂(Q) optimal decisions to select one of Q partitions. Thus, a Curator_S which magnifies the probability of S by Q can be implemented using log₂(Q) decisions between choices derived from GPT.

2.1 and 2.2 together imply that a Curator_S magnifies the probability of S by a factor of Q if and only if it can be implemented using log₂ optimal binary decisions between samples of GPT.

3) Optimizing for utility functions

A more complex curator, Curator_U

Other

Optimization is time-reversed inference

Rationalization is when the algorithm for optimization is used when the one for inference should be used. …

Questions

…

What about satisficers?

…

WIP

These proofs and discussions are unfinished or still have open questions I haven’t solved.

What if log₂(n) is not a whole number?

How many bits does it take to specify 1 out of 3 events?

One possible answer is that you need 2 bits, because 1 is clearly not enough. But is it really correct to say that the same number of bits are needed to represent a choice out of 4 as a choice out of 3? If we detach from the base-2 system, it sure seems intuitively like less information should be needed to distinguish between less choices.

Imagine you used this scheme to represent the equiprobable events a, b, and c:

00 : a

01 : b

10 : c

11 : c

The second bit is redundant 1/3 of the time, when the event is c. The first bit, however, always provides information. If we think of “the number of bits needed to specify a choice” as “the number of yes/no questions needed to specify a choice”, then what is the sequence of questions that is will most efficiently locate the answer?

If we first ask “is the first bit 0 or 1?”, there is a 2/3 chance the answer will be 0, and then we’ll have to ask another question. But there’s a 1/3 chance the answer is 1, and we have our answer without having to ask another question. Using this strategy, the expectation for the number of questions needed to locate the answer is

2/3 * 2 + 1/3 * 1 = 5/3 = 1.66666…

In information theory, the correct answer is considered to be log₂(3), which is 1.58496… This measure of the amount of information does not privilege the base 2 system, and can be converted into any other base, such as log_e or ln to give us the hypothetical amount og nats it would take to represent the event.

I haven’t yet found a way to obtain an expectation equal to log₂(n) binary questions for n choices, although the scheme that seems optimal to me (the one described above) gets suspiciously close to it.

Until I can find a resolution, I will place my trust in information theory and assume elsewhere that it makes sense to say that log₂(n) bits are needed to represent a choice between n decisions.