posted on 2022-08-20 — also cross-posted on lesswrong, see there for comments

(this post has been written for the second Refine blog post day. thanks to vanessa kosoy, adam shimi, sid black, artaxerxes, and paul bricman for their feedback.)

PreDCA: vanessa kosoy's alignment protocol

in this post, i try to give an overview of vanessa kosoy's new alignment protocol, Precursor Detection, Classification and Assistance or PreDCA, as she describes it in a recent youtube talk.

keep in mind that i'm not her and i could totally be misunderstanding her video or misfocusing on what the important parts are supposed to be.

the gist of it is: the goal of the AI should be to assist the user by picking policies which maximize the user's utility function. to that end, we characterize what makes an agent and its utility function, then detect agents which could potentially be the user by looking for precursors to the AI, and finally we select a subset of those which likely contains the user. all of this is enabled by infra-bayesian physicalism, which allows the AI to reason about what the world is like and what the results of computations are.

the rest of this post is largely a collection of mathematical formulas (or informal suggestions) defining those concepts and tying them together.

an important aspect of PreDCA is that the mathematical formalisms are theoretical ones which could be given to the AI as-is, not necessarily specifications as to what algorithms or data structures should exist inside the AI. ideally, the AI could just figure out what it needs to know about them, to what degree of certainty, and using what computations.

the various pieces of PreDCA are described below.

infra-bayesian physicalism, in which an agent has a hypothesis Θ ∈ □(Φ×Γ) (note that is actually a square, not a character that your computer doesn't have a glyph for) where:

vanessa emphasizes that infra-bayesian physicalist hypotheses are described "from a bird's eye view" as opposed to being agent-centric, which helps with embedded agency: the AI has guesses as to what the whole world is like, which just happens to contain itself somewhere. in a given hypothesis, the AI is simply described as a part of the world, same as any other part.

next, a measure of agency is then defined: a "g-factor" g(G|U) for a given agent G and a given utility function (or loss function) U, which is defined as g(G|U) = -log(Pr π∈ξ [U(⌈G⌉,π) ≥ U(⌈G⌉,G*)]) where

so g(G|U) measures how good agent G is at satisfying a given utility function U.

given g(G|U), we can infer the probability that an agent G has a given utility function U, as Pr[U] ∝ 2^-K(U) / Pr π∈ξ [U(⌈G⌉,π) ≥ U(⌈G⌉,G*)]) where means "is proportional to" and K(U) is the kolmogorov complexity of utility function U.

so an agent G probably has utility function U if it's relatively good at satisfying that utility function and if that utility function is relatively simple — we penalize arbitrarily complex utility functions notably to avoid hypotheses such as "woah, this table is really good at being the exact table it is now (a complete description of the world would be an extremely complex utility function).

we also get the ability to detect what programs are agents — or more precisely, how agenty a given program is: g(G|U) - K(U) tells us how agenty a program G with utility function U is: its agentyness is its g-factor minus the complexity of its utility function.

"computationalism and counterfactuals": given a belief Θ ∈ □(Φ×Γ), the AI can test whether it thinks the world contains a given program by examining the following counterfactual: "if the result of that program was a different result than what it actually is, would the world look different?"

for example, we can consider the AKS prime number testing algorithm. let's say AKS(2^82589933-1) returns TRUE. we can ask "if it returned FALSE instead, would the universe — according to our computational hypothesis about it — look different?" if it would look different, then that means that someone or something in the world is running the program AKS(2^82589933-1).

to offer a higher-level example: if we were to know the true name of suffering, described as a program, then we can test whether the world contains suffering by asking a counterfactual: let's say that every time suffering happened, a goldfish appeared (somehow as an output of the suffering computation). if that were the case, would the world look different? if it would, then it contains suffering.

this ability to determine which programs are running in the world, coupled with the ability to measure how agenty a given program is, lets us find what agents exist in the world.

agentic causality: to determine whether an agent H's executed policy H* can causate onto another agent G, we can ask whether, if H had executed a different policy π≠H*, the agent G would receive different inputs. we can apparently get an information-theoritic measure of "how impactful" H* is onto agent G by determining how much mutual information there is between H* and G's input.

precursor detection: we say that an agent H is a precursor of agent G if, counterfactually, H could have prevented G from existing by executing a policy which is different from its actual policy H*.

we can now start to build a definition that lets the AI detect and then classify who its user is.

user detection: the AI is trying to determine who its precursor program could be. but, given a hypothesis for "the thing producing these policies is the precursor", there are infinitely many different programs which could output the observed policies. so we choose the one which is the most agenty, using the function described above: g(H|U) - K(U).

note that while we extrapolate the user's actions into the future, the user is defined as an instant-agent which precedes the AI's existence; such that the actual physical person's actual future actions does not change what utility function the AI should try to maximize. this stops the AI from influencing the user's utility function: we define the user strictly in the past, causally outside of the AI's light-cone. the AI is maximizing the utility function of the instant-user which causated its existence, not that of the continuously existing user-over-time.

user classification: for each potential precursor hypothesis, we have now selected a program that models them and their respective utility functions. we then eliminate some hypotheses as to what the user could be — notably to avoid acausal attacks by remote aliens or counterfactual demons — using the following criteria:

finally, we end up with a hopefully small set of hypotheses as to who the user could be; at that point, we simply compose their utility functions, perhaps weighed by the infra-distribution of each of those hypotheses. this composition is the utility function that the AI should want to maximize, by selecting policies which maximize the utility that the world would have if they were enacted, to the best of the AI's ability to evaluate.

vanessa tells us how far along her protocol is, as a collection of pieces that have been completed to various degrees — green parts have gotten some progress, purple parts not as much. "informal PreDCA" is the perspective that she provides in her talk and which is hopefully conveyed by this post.

finally, some takeaways that can be taken from this informal PreDCA perspective:

my own opinion is that PreDCA is a very promising perspective. it offers, if not full "direct alignment", at least a bunch of pieces that might be of significant use to general AI risk mitigation.

posted on 2022-08-20 — also cross-posted on lesswrong, see there for comments

CC_ -1 License unless otherwise specified on individual pages, all posts on this website are licensed under the CC_-1 license.
unless explicitely mentioned, all content on this site was created by me; not by others nor AI.