This is an appendix for Quantifying Curation, containing proofs of some of the results used there, further exploration of the theory of optimizers, and discussion of special cases mentioned in the post.
I apologize if these demonstrations are less understandable or elegant than traditional treatments. This is very simple math but connected to deep ideas in learning and optimization; it was fun for me to do this without reference.
Bits and probability
1) Selection and sampling
1.1) Given n options, it takes exactly log2(n) bits to specify a single choice
Imagine bitstrings: m bits can distinguish between 2m possibilities. So in order to specify a choice of one out of n options, you need log2(n) bits.
For example, 2 bits can encode 4 options:
00 : a
01 : b
10 : c
11 : d
1.2) Probability of an event in n samples without replacement is n * p
Say that a probability distribution ƒX assigns an event x (and) probability p. If we sample n times from the distribution,
The first sample will be event x with probability p.
With probability 1-p, it will not be x. In the latter case, x stays in the pool, and the total remaining probability mass
1.2) E[selecting from n partitions of ƒX] == E[selecting from n random samples from ƒX]
2) Proof: log2(Q) bits of optimization == magnifying probability by Q
We will first consider a simple curator, CuratorS, who acts on GPT and optimizes for only one particular event, S, and always acts optimally to increase the likelihood of S.
CuratorS is a function which takes the function GPT as an input, and can sample from GPT and choose between those samples. It can only “speak” through GPT; that is, it must return an event that is generated by GPT, though it may select which among multiple samples to return.
2.1) log2(Q) bits => magnifying probability by Q
Say that CuratorS is allowed to sample GPT Q times and chooses optimally.
We have shown that this has the same expectation for S as choosing between Q random partitions of GPT() and choosing optimally. By 1.1, it takes log2(Q) decisions to make that selection between Q partitions. Choosing between Q random partitions will magnify the probability of S by Q. Thus, the Curator making log2(Q) optimal decisions results in a magnification of the probability of S by Q.
2.2) magnifying probability by Q => log2(Q) bits
Here I will prove that a CuratorS which magnifies the probability of S by Q can always be implemented using log2(Q) optimal binary decisions between choices derived from GPT.
Call p the probability of S according to GPT() and p' the probability of S according to CuratorS.
If the sample was drawn from a partition of GPT() which contained 1/Q of its total probability mass, and S was in this distribution, S would be Q times more likely than if sampled directly from GPT. So if p * Q = p', then Q = p' / p.
So Q = p' / p is the number of options the Curator must choose between in order to magnify the probability of S from p to p' (a factor of Q). It takes log2(Q) optimal decisions to select one of Q partitions. Thus, a CuratorS which magnifies the probability of S by Q can be implemented using log2(Q) decisions between choices derived from GPT.
2.1 and 2.2 together imply that a CuratorS magnifies the probability of S by a factor of Q if and only if it can be implemented using log2 optimal binary decisions between samples of GPT.
3) Optimizing for utility functions
A more complex curator, CuratorU
Optimization is time-reversed inference
Rationalization is when the algorithm for optimization is used when the one for inference should be used. …
What about satisficers?
These proofs and discussions are unfinished or still have open questions I haven’t solved.
What if log2(n) is not a whole number?
How many bits does it take to specify 1 out of 3 events?
One possible answer is that you need 2 bits, because 1 is clearly not enough. But is it really correct to say that the same number of bits are needed to represent a choice out of 4 as a choice out of 3? If we detach from the base-2 system, it sure seems intuitively like less information should be needed to distinguish between less choices.
Imagine you used this scheme to represent the equiprobable events a, b, and c:
00 : a
01 : b
10 : c
11 : c
The second bit is redundant 1/3 of the time, when the event is c. The first bit, however, always provides information. If we think of “the number of bits needed to specify a choice” as “the number of yes/no questions needed to specify a choice”, then what is the sequence of questions that is will most efficiently locate the answer?
If we first ask “is the first bit 0 or 1?”, there is a 2/3 chance the answer will be 0, and then we’ll have to ask another question. But there’s a 1/3 chance the answer is 1, and we have our answer without having to ask another question. Using this strategy, the expectation for the number of questions needed to locate the answer is
2/3 * 2 + 1/3 * 1 = 5/3 = 1.66666…
In information theory, the correct answer is considered to be log2(3), which is 1.58496… This measure of the amount of information does not privilege the base 2 system, and can be converted into any other base, such as loge or ln to give us the hypothetical amount og nats it would take to represent the event.
I haven’t yet found a way to obtain an expectation equal to log2(n) binary questions for n choices, although the scheme that seems optimal to me (the one described above) gets suspiciously close to it.
Until I can find a resolution, I will place my trust in information theory and assume elsewhere that it makes sense to say that log2(n) bits are needed to represent a choice between n decisions.