 posted on 2023-06-10 — also cross-posted on lesswrong, see there for comments

this post was written by Tamsin Leake at Orthogonal.
thanks to Julia Persson and mesaoptimizer for their help putting it together.

## an Evangelion dialogue explaining the QACI alignment plan

this post explains the justification for, and the math formalization of, the QACI plan for formal-goal alignment. you might also be interested in its companion post, formalizing the QACI alignment formal-goal, which just covers the math in a more straightforward, bottom-up manner. #### 1. agent foundations & anthropics

🟣 misato — hi ritsuko! so, how's this alignment stuff going?

🟡 ritsuko — well, i think i've got an idea, but you're not going to like it.

🟢 shinji — that's exciting! what is it?

🟡 ritsuko — so, you know how in the sequences and superintelligence, yudkowsky and bostrom talk about how hard it is to fully formalize something which leads to nice things when maximized by a utility function? so much so that it serves as an exercise to think about one's values and consistently realize how complex they are?

🟡 ritsuko — ah, yes, the good old days when we believed this was the single obstacle to alignment.

🔴 asuka barges into the room and exclaims — hey, check this out! i found this fancy new theory on lesswrong about how "shards of value" emerge in neural networks!

🔴 asuka then walks away while muttering something about eiffel towers in rome and waluigi hyperstition…

🟡 ritsuko indeed. these days, all these excited kids running around didn't learn about AI safety by thinking really hard about what agentic AIs would do — they got here by being spooked by large language models, and as a result they're thinking in all kinds of strange directions, like what it means for a language model to be aligned or how to locate natural abstractions for human values in neural networks.

🟢 shinji — of course that's what we're looking at! look around you, turns out that the shape of intelligence is RLHF'd language models, not agentic consequentialists! why are you still interested in those old ideas?

🟡 ritsuko — the problem, shinji, is that we can't observe agentic AI being published before alignment is solved. when someone figures out how to make AI consequentialistically pursue a coherent goal, whether by using current ML technology or by building a new kind of thing, we die shortly after they publish it.

🟣 misato — wait, isn't that anthropics? i'd rather stay away from that type of thinking, it seems too galaxybrained to reason about…

🟡 ritsuko — you can't really do that either — the "back to square one" interpretation of anthropics, where you don't update at all, is still an interpretation of anthropics. it's kind of like being the kind of person who, when observing having survived quantum russian roulette 20 times in a row, assumes that the gun is broken rather than saying "i guess i might have low quantum amplitude now" and fails to realize that the gun can still kill them — which is bad when all of our hopes and dreams rests on those assumptions. the only vaguely anthropics-ignoring perspective one can take about this is to ignore empirical evidence and stick to inside view, gears-level prediction of how convergent agentic AI tech is.

🟣 misato — …is it?

🟡 ritsuko — of course it is! on inside view, all the usual MIRI arguments hold just fine. it just so happens that if you keep running a world forwards, and select only for worlds that we haven't died in, then you'll start observing stranger and stranger non-consequentialist AI. you'll start observing the kind of tech we get when just dumbly scale up bruteforce-ish methods like machine learning and you observe somehow nobody publishing insights as to how to make those systems agentic or consequentialistic.

🟢 shinji — that's kind of frightening!

🟡 ritsuko — well, it's where we are. we already thought we were small in space, now we also know that we're also small in probabilityspace. the important part is that it doesn't particularly change what we should do — we should still try to save the world, in the most straightforward fashion possible.

🟣 misato — so all the excited kids running around saying we have to figure out how to align language models or whatever…

🟡 ritsuko — they're chasing a chimera. impressive LLMs are not what we observe because they're what powerful AI looks like — they're what we observe because they're what powerful AI doesn't look like. they're there because that's as impressive as you can get short of something that kills everyone.

🟣 misato — i'm not sure most timelines are dead yet, though.

🟡 ritsuko — we don't know if "most" timelines are alive or dead from agentic AI, but we know that however many are dead, we couldn't have known about them. if every AI winter was actually a bunch of timelines dying, we wouldn't know.

🟣 misato — you know, this doesn't necessarily seem so bad. considering that confused alignment people is what's caused the appearance of the three organizations trying to kill everyone as fast as possible, maybe it's better that alignment research seems distracted with things that aren't as relevant, rather than figuring out agentic AI.

🟡 ritsuko — you can say that alright! there's already enough capability hazards being carelessly published everywhere as it is, including on lesswrong. if people were looking in the direction of the kind of consequentialist AI that actually determines the future, this could cause a lot of damage. good thing there's a few very careful people here and there, studying the right thing, but being very careful by not publishing any insights. but this is indeed the kind of AI we need to figure out if we are to save the world.

🟢 shinji — whatever kind of anthropic shenanigans are at play here, they sure seem to be saving our skin! maybe we'll be fine because of quantum immortality or something?

🟣 misato — that's not how things work shinji. quantum immortality explains how you got here, but doesn't help you save the future.

🟢 shinji sighs, with a defeated look on his face — …so we're back to the good old MIRI alignment, we have to perfectly specify human values as a utility function and figure out how to align AI to it? this seems impossible!

🟡 ritsuko — well, that's where things get interesting! now that we're talking about coherent agents whose actions we can reason about, agents whose instrumentally convergent goals such as goal-content integrity would be beneficial if they were aligned, agents who won't mysteriously turn bad eventually because they're not yet coherent agents, we can actually get to work putting something together.

🟣 misato — …and that's what you've been doing?

🟡 ritsuko — well, that's kind of what agent foundations had been about all along, and what got rediscovered elsewhere as "formal-goal alignment": designing an aligned coherent goal and figuring out how to make an AI that is aligned to maximizing it.

#### 2. embedded agency & untractability

🟢 shinji — so what's your idea? i sure could use some hope right now, though i have no idea what an aligned utility function would even look like. i'm not even sure what kind of type signature it would have!

🟡 ritsuko smirks — so, the first important thing to realize is that the challenge of designing an AI that emits output which save the world, can be formulated like this: design an AI trying to solve a mathematical problem, and make the mathematical problem be analogous enough to "what kind of output would save the world" that the AI, by solving it, happens to also save our world.

🟢 shinji — but what does that actually look like?

🟣 misato — maybe it looks like "what output should you emit, which would cause your predicted sequence of stimuli to look like a nice world?"

🟡 ritsuko — what do you think actually happens if an AI were to succeed at this?

🟣 misato — oh, i guess it would hack its stimuli input, huh. is there even a way around this problem?

🟡 ritsuko — what you're facing is a facet of the problem of embedded agency. you must make an AI which thinks about the world which contains it, not just about a system that it feels like it is interacting with.

🟡 ritsuko — the answer — as in PreDCA — is to model the world from the top-down, and ask: "look into this giant universe. you're in there somewhere. which action should the you-in-there-somewhere take, for this world to have the most expected utility?"

🟢 shinji — expected utility? by what utility function?

🟡 ritsuko — we're coming to it, shinji. there are three components to this: the formal-goal-maximizing AI, the formal-goal, and the glue in-between. embedded agency and decision theory are parts of this glue, and they're core to how we think about the whole problem.

🟣 misato — and this top-down view works? how the hell would it compute the whole universe? isn't that uncomputable?

🟡 ritsuko — how the hell do you expect AI would have done expected utility maximization at all? by making reasonable guesses. i can't compute the whole universe from the big-bang up to you right now, but if you give me a bunch of math which i'd understand to say "in worlds being computed forwards starting at some simple initial state and eventually leading to this room right now with shinji, misato, ritsuko in it, what is shinji more likely to be thinking about: his dad, or the pope's uncle?"

🟡 ritsuko — on the one hand, the question is immensely computationally expensive — it asks to compute the entire history of the universe up to this shinji! but on the other hand, it is talking about a world which we inhabit, and about which we have the ability to make reasonable guesses. if we build an AI that is smarter than us, you can bet it'll bet able to make guesses at least as well as this.

🟣 misato — i'm not convinced. after all, we relied on humans to make this guess! of course you can guess about shinji, you're a human like him. why would the AI be able to make those guesses, being the alien thing that it is?

🟡 ritsuko — i mean, one of its options is to ask humans around. it's not like it has to do everything by itself on its single computer, here — we're talking about the kind of AI that agentically saves the world, and has access to all kinds of computational resources, including humans if needed. i don't think it'll actually need to rely on human compute a lot, but the fact that it can serves as a kind of existence proof for its ability to produce reasonable solutions to these problems. not optimal solutions, but reasonable solutions — eventually, solutions that will be much better than any human or collection of humans could be able to come up with short of getting help from aligned superintelligence.

🟢 shinji — but what if the worlds that are actually described by such math are not in fact this world, but strange alien worlds that look nothing like ours?

🟡 ritsuko — yes, this is also part of the problem. but let's not keep moving the goalpost here. there are two problems: make the formal problem point to the right thing (the right shinji in the right world), and make an AI that is good at finding solutions to that problem. both seem like we can solve them with some confidence; but we can't just keep switching back and forth between the two.

🟡 ritsuko — if you have to solve two problems A and B, then you have to solve A assuming B is solved, and then solve B assuming A is solved. then, you've got a pair of solutions which work with one another. here, we're solving the problem of whether an AI would be able to solve this problem, assuming the problem points to the right thing; later we'll talk about how to make the problem point to the right thing assuming we have an AI that can solve it.

🟢 shinji — are there any actual implementation ideas for how to build such a problem-solving AI? it sure sounds difficult to me!

🟣 misato, carefully peeking into the next room — hold on. i'm not actually quite sure who's listening — it is known that capabilities people like to lurk around here.

🟤 kaji can be seen standing against a wall, whistling, pretending not to hear anything.

🟡 ritsuko — right. one thing i will reiterate, is that we should not observe a published solution to "how to get powerful problem-solving AI" before the world is saved. this is in the class of problems which we die shortly after a solution to it is found and published, so our lack of observing such a solution is not much evidence for its difficulty.

#### 3. one-shot AI

🟡 ritsuko — anyways, to come back to embedded agency.

🟣 misato — ah, i had a question. the AI returns a first action which it believes would overall steer the world in a direction that maximizes its expected utility. and then what? how does it get its observation, update its model, and take the next action?

🟡 ritsuko — well, there are a variety of clever schemes to do this, but an easy one is to just not.

🟣 misato — what?

🟡 ritsuko — to just not do anything after the first action. i think the simplest thing to build is what i call a "one-shot AI", which halts after returning an action. and then we just run the action.

🟢 shinji — "run the action?"

🟡 ritsuko — sure. we can decide in advance that the action will be a linux command to be executed, for example. the scheme does not really matter, so long as the AI gets an output channel which has pretty easy bits of steering the world.

🟣 misato — hold on, hold on. a single action? what do you intend for the AI to do, output a really good pivotal act and then hope things get better?

🟡 ritsuko — have a little more imagination! our AI — let's call it AI₀ — will almost certainly return a single action that builds and then launches another, better AI, which we'll call AI₁. a powerful AI can absolutely do this, especially if it has the ability to read its own source-code for inspiration, but probably even without that.

🟡 ritsuko — …and because it's solving the problem "what action would maximize utility when inserted into this world", it will understand that AI₁ needs to have embedded agency and the various other aspects that are instrumental to it — goal-content integrity, robustly delegating RSI, and so on.

🟢 shinji — "RSI"? what's that?

🟣 misato sighs — you know, it keeps surprising me how many youths don't know about the acronym RSI, which stands for Recursive Self-Improvement. it's pretty indicative of how little they're thinking about it.

🟢 shinji — i mean, of course! recursive self-improvement is an obsolete old MIRI idea that doesn't apply to the AIs we have today.

🟣 misato — right, kids like you got into alignment by being spooked by chatbots. (what silly things do they even teach you in class these days?)

🟣 misato — you have to realize that the generation before you, the generation of ritsuko and i, didn't have the empirical evidence that AI was gonna be impressive. we started on something like the empty string, or at least coherent arguments where we had to actually build a gears-level inside-view understanding of what AI would be like, and what it would be capable of.

🟣 misato — to me, one of the core arguments that sold me on the importance of AI and alignment was recursive self-improvement — the idea that AI being better than humans at designing AI would be a very special, very critical point in time, downstream of which AI would be able to beat humans at everything.

🟢 shinji — but this turned out irrelevant, because AI is getting better than humans without RSI–

🟡 ritsuko — again, false. we can only observe AI getting better than humans at intellectual tasks without RSI, because when RSI is discovered and published, we die very shortly thereafter. you have a sort of consistent survivorship bias, where you keep thinking of a whole class of things as irrelevant because they don't seem impactful, when in reality they're the most impactful; they're so impactful that when they happen you die and are unable to observe them.

#### 4. action scoring

🟣 misato — so, i think i have a vague idea of what you're saying, now. top-down view of the universe, which is untractable but that's fine apparently, thanks to some mysterious capabilities; one-shot AI to get around various embedded agency difficulties. what's the actual utility function to align to, now? i'm really curious. i imagine a utility function assigns a value between 0 and 1 to any, uh, entire world? world-history? multiverse?

🟡 ritsuko — it assigns a value between 0 and 1 to any distribution of worlds, which is general enough to cover all three of those cases. but let's not get there yet; remember how the thing we're doing is untractable, and we're relying on an AI that can make guesses about it anyways? we're gonna rely on that fact a whole lot more.

🟣 misato — oh boy.

🟡 ritsuko — so, first: we're not passing a utility function. we're passing a math expression describing an "action-scoring function" — that is to say, a function attributing scores to actions rather than to distributions over worlds. we'll make the program deterministic and make it ignore all input, such that the AI has no ability to steer its result — its true result is fully predetermined, and the AI has no ability to hijack that true result.

🟣 misato — wait, "hijack it"? aren't we assuming an inner-aligned AI, here?

🟡 ritsuko — i don't like this term, "inner-aligned"; just like "AGI", people use it to mean too many different and unclear things. we're assuming an AI which does its best to pick an answer to a math problem. that's it.

🟡 ritsuko — we don't make an AI which tries to not be harmful with regards to its side-channels, such as hardware attacks — except for its output, it needs to be strongly boxed, such that it can't destroy our world by manipulating software or hardware vulnerabilities. similarly, we don't make an AI which tries to output a solution we like, it tries to output a solution which the math would score high. narrowing what we want the AI to do greatly helps us build the right thing, but it does add constraints to our work.

🟡 ritsuko starts scribbling on a piece of paper on her desk — let's write down some actual math here. let's call $Ω$ the set of world-states, $ΔΩ$ distributions over world-states, and $A$ be the set of actions.

🟢 shinji — what are the types of all of those?

🟡 ritsuko — let's not worry about that, for now. all we need to assume for the moment is that those sets are countable. we could define both $Ω≔𝔹*$ and $A≔𝔹*$ — define them both as the set of finite bitstrings — and this would functionally capture all we need. as for distributions over world-states $ΔΩ$, we'll define $ΔX≔{f|f∈X→[0;1],∑x∈Xxf(x)≤1}$ for any countable set $X$, and we'll call "mass" the number which a distribution associates to any element.

🟣 misato — woah, woah, hold on, i haven't looked at math in a while. what do all those squiggles mean?

🟡 ritsuko$ΔX$ is defined as the set of functions $f$, which take an $X$ and return a number between $0$ and $1$, such that if you take the $f$ of all $x$'s in $X$ and add those up, you get a number not greater than $1$. note that i use a notation of sums $∑$ where the variables being iterated over are above the $∑$ and the constraints that must hold are below it — so this sum adds up all of the $f(x)$ for each $x$ such that $x∈X$.

🟣 misato — um, sure. i mean, i'm not quite sure what this represents yet, but i guess i get it.

🟡 ritsuko — the set $ΔX$ of distributions over $X$ is basically like saying "for any finite amounts of mass less than 1, what are some ways to distribute that mass among some or all of the $X$'s?" each of those ways is a distribution; each of those ways is an $f$ in $ΔX$.

🟡 ritsuko — anyways. the AI will take as input an untractable math expression of type $A→[0;1]$, and return a single $A$. note that we're in math here, so "is of type" and "is in set" are really the same thing; we'll use $∈$ to denote both set membership and type membership, because they're the same concept. for example, $A→[0;1]$ is the set of all functions taking as input an $A$ and returning a $[0;1]$ — returning a real number between $0$ and $1$.

🟢 shinji — hold on, a real number?

🟡 ritsuko — well, a real number, but we're passing to the AI a discrete piece of math which will only ever describe countable sets, so we'll only ever describe countably many of those real numbers. infinitely many, but countably infinitely many.

🟣 misato — so the AI has type $(A→[0;1])→A$, and we pass it an action-scoring function of type $A→[0;1]$ to get an action. checks out. where do utility functions come in?

🟡 ritsuko — they don't need to come in at all, actually! we'll be defining a piece of math which describes the world for the purpose of pointing at the humans who will decide on a scoring function, but the scoring function will only be over actions the AI should take.

🟡 ritsuko — the AI doesn't need to know that its math points to the world it's in; and in fact, conceptually, it isn't told this at all. on a fundamental, conceptual manner, it is not being told to care about the world it's in — if it could, it would take over our world and kill everyone in it to acquire as much compute as possible, and plausibly along the way drop an anvil on its own head because it doesn't have embedded agency with regards to the world around itself.

🟡 ritsuko — we will just very carefully box it such that its only meaningful output into our world, the only bits of steering it can predictably use, are those of the action it outputs. and we will also have very carefully designed it such that the only thing it ultimately cares about, is that that output have as high of an expected scoring as possible — it will care about this intrinsically, and nothing else intrinsically, such that doing that will be more important than hijacking our world through that output.

🟡 ritsuko — this meaning of "inner-alignment" is still hard to accomplish, but it is much better defined, much narrower, and thus hopefully much easier to accomplish than the "full" embedded-from-the-start alignments which very slow, very careful corrigibility-based AI alignment would result in.

#### 5. early math & realityfluid

🟣 misato — so what does that scoring function actually look like?

🟡 ritsuko — you know what, i hadn't started mathematizing my alignment idea yet; this might be a good occasion to get started on that!

🟡 ritsuko wheels in a whiteboard — so, what i expect is that the order in which we're gonna go over the math is going to be the opposite order to that of the final math report on QACI. here, we'll explore things from the top-down, filling in details as we go — whereas the report will go from the bottom-up, fully defining constructs and then using them.

$Prior∈ΔHypothesisLooksLikeThisWorld∈Hypothesis→[0;1]HowGood∈A→[0;1]hScore(action)≔∑Prior(h)⋅LooksLikeThisWorld(h)⋅HowGood(action,h)h∈Hypothesis$

🟡 ritsuko — this is roughly what we'll be doing here. go over all hypotheses $h$ the AI could have within some set of hypotheses, called $Hypothesis$; measure their $Prior$ probability, the $LooksLikeThisWorld$ that they correspond to our world, and how good the $action$ are in them. this is the general shape of expected scoring for actions.

🟢 shinji — wait, the set of hypotheses is called $Hypothesis$, not $Hypotheses$? that's a bit confusing.

🟡 ritsuko — this is pretty standard in math, shinji. the reason to call the set of hypotheses $Hypothesis$ is because, as explained before, sets are also types, and so $LooksLikeThisWorld$ will be of type $Hypothesis→[0;1]$ rather than $Hypotheses→[0;1]$.

🟣 misato — what's in a $Hypothesis$, exactly?

🟡 ritsuko — the set of all relevant beliefs about things. or rather, the set of all relevant beliefs except for logical facts. logical uncertainty will be a thing on the AI's side, not in the math — this math lives in the realm "platonic perfect true math", and the AI will have beliefs about what its various parts tend to result in as one kind of logical belief, just like it'll have beliefs about other logical facts.

🟣 misato — so, a mathematical object representing empirical beliefs?

🟡 ritsuko — i would rather put it as a pair of: beliefs about what's real ("realityfluid" beliefs); and beliefs about where, in the set of real things, the AI is ("indexical" beliefs). but this can be simplified by allocating realityfluid across all mathematical/computational worlds (this is equivalent to assuming tegmark the level 4 multiverse is real, and can be done by assuming the cosmos to be a "universal complete" program running all computations) and then all beliefs are indexical. these two possibilities work out to pretty much the same math, anyways.

🟢 shinji — what the hell is "realityfluid"???

🟡 ritsukoit's a very long story, i'm afraid.

🟣 misato — think of it as a measure of how some constant amount of "matteringness"/"realness" — typically 1 unit of it — is distributed across possibilities. even though it kinda mechanistically works like probability mass, it's "in the other direction": it represents what's actually real, rather than representing what we believe.

🟢 shinji — why would it sum to 1? what if there's an infinite amount of stuff out there?

🟡 ritsuko — indeed. this is why the most straightforward way to allocate realityfluid is to just imagine that the set of all that exists is a universal program whose computation is cut into time-steps each doing a constant amount of work, and then allocate some diminishing quantities of realityfluid to each time step.

🟣 misato — like saying that compute step number $n≥1$ has $12n$ realityfluid?

🟡 ritsuko — that would indeed normalize, but it diminishes exponentially fast. this makes world-states exponentially unlikely in the amount of compute they exist after; and there are philosophical reasons to say that exponential unlikelyness is what should count as non-existing.

🟢 shinji — what the hell are you talking about??

🟡 ritsuko hands shinji a paper called "Why Philosophers Should Care About Computational Complexity" — look, this is a whole other tangent, but basically, polynomial amounts of computation corresponds to "doing something", whereas exponential amounts of computation correspond to "magically obtaining something out of the ether", and this sort-of ramificates naturally across the rest of computational complexity applied to metaphysics and philosophy.

🟡 ritsuko — so instead, we can say that computation step number $n≥1$ has $1n2$ realityfluid. this only diminishes quadratically, which is satisfactory.

🟡 ritsuko — oh, and for the same reason, the universal program needs to be quantum — for example, it needs to be a quantum equivalent of the classical universal program but for quantum computation, implemented on something like a quantum turing machine). otherwise, unless BQP=BPP, quantum multiverses like ours might be exponentially expensive to compute, which would be strange.

🟢 shinji — why $n2$? why not $n1.01$ or $n37$?

🟡 ritsuko — those do indeed all normalize — but we pick $2$ because at some point you just have to pick something, and $2$ is a natural, occam/solomonoff-simple number which works. look, just–

🟢 shinji — and why are we assuming the universe is made of discrete computation anyways? isn't stuff made of real numbers?

🟡 ritsuko sighs — look, this is what the church-turing-deutsch principle is about. for any universe made up of real numbers, you can approximate it thusly:

• compute 1 step of it with every number truncated to its first 1 binary digit of precision
• compute 1 step of it with every number truncated to its first 2 binary digits of precision

for 1 time step with 1 bit of precision, then 2 time steps with 2 bits of precision, then 3 with 3, and so on. for any piece of branch-spacetime which is only finitely far away from the start of its universe, there exists a threshold at which it starts being computed in a way that is indistinguishable from the version with real numbers.

🟢 shinji — but they're only an approximation of us! they're not the real thing!

🟡 ritsuko sighs — you don't know that. you could be the approximation, and you would be unable to tell. and so, we can work without uncountable sets of real numbers, since they're unnecessary to explain observations, and thus an unnecessary assumption to hold about reality.

🟢 shinji, frustrated — i guess. it still seems pretty contrived to me.

🟡 ritsuko — what else are you going to do? you're expressing things in math, which is made of discrete expressions and will only ever express countable quantities of stuff. there is no uncountableness to grab at and use.

🟣 misato — actually, can't we introduce turing jumps/halting oracles into this universal program? i heard that this lets us actually compute real numbers.

🟡 ritsuko — there's kind-of-a-sense in which that's true. we could say that the universal program has access to a first-degree halting oracle, or a 20th-degree; or maybe it runs for 1 step with a 1st degree halting oracle, then 2 steps with a 2nd degree halting oracle, then 3 with 3, and so on.

🟡 ritsuko — your program is now capable, at any time step, of computing an infinite amount of stuff. let's say one of those steps happens to run an entire universe of stuff, including a copy of us. how do you sub-allocate realityfluid? how much do we expect to be in there? you could allocate sub-compute-steps — with a 1st degree halting oracle executing at step $n≥1$, you allocate $1n2m2$ realityfluid to each of the $m≥1$ infinite sub-steps in the call to the halting-oracle. you're just doing discrete realityfluid allocation again, except now your some of the realityfluid in your universe is allocated at people who have obtained results from a halting oracle.

🟡 ritsuko — this works, but what does it get you? assuming halting oracles is kind of a very strange thing to do, and regular computation with no halting oracles is already sufficient to explain this universe. so we don't. but sure, we could.

🟢 shinji ruminates, unsure where to go from there.

🟣 misato interrupts — hey, do we really need to cover this? let's say you found out that this whole view of things is wrong. could you fix your math then, to whatever is the correct thing?

🟡 ritsuko waves around — what?? what do you mean if it's wrong?? i'm not rejecting the premise that i might be wrong here, but like, my answer here depends a lot on in what way i'm wrong and what is the better / more likely correct thing. so, i don't know how to answer that question.

🟣 misato snaps shinji back to attention — that's fair enough, i guess. well, let's get back on track.

#### 6. precursor assistance

🟡 ritsuko — so, one insight i got for my alignment idea came from PreDCA, which stands for Precursor Detection, Classification, and Assistance. it consists of mathematizations for:

• the AI locating itself within possibilities
• locating the high-agenticness-thing which had lots of causation-bits onto itself — call it the "Precursor". this is supposed to find the human user who built/launched the AI. (Detection)
• bunch of criteria to ensure that the precursor is the intended human user and not something else (Classification)
• extrapolating that precursor's utility function, and maximizing it (Assistance)

🟣 misato — what the hell kind of math would accomplish that?

🟡 ritsuko — well, it's not entirely clear to me. some of it is explained, other parts seem like they're expected to just work naturally. in any case, this isn't so important — the "Learning Theoretic Agenda" into which PreDCA fits is not fundamentally similar to mine, and i do not expect it to be the kind of thing that saves us in time. as far as i predict, that agenda has purchased most of the dignity points it will have cashed out when alignment is solved, when it inspired my own ideas.

🟢 shinji — and your agenda saves us in time?

🟡 ritsuko — a lot more likely so, yes! for one, i am not trying to build an entire theory of intelligence and machine learning, and i'm not trying to develop an elegant new form of bayesianism whose model of the world has concerning philosophical ramifications which, while admittedly possibly only temporary, make me concerned about the coherency of the whole edifice. what i am trying to do, is hack together the minimum viable world-saving machine about which we'd have enough confidence that launching it is better expected value than not launching it.

🟡 ritsuko — anyways, the important thing is that that idea made me think "hey, what else could we do to even more make sure the selected precursor is the human use we want, and not something else like a nearby fly or the process of evolution?" and then i started to think of some clever schemes for locating the AI in a top-down view of the world, without having to decode physics ourselves, but rather by somehow pointing to the user "through" physics.

🟣 misato — what does that mean, exactly?

🟡 ritsuko — well, remember how PreDCA points to the user from-the-top-down? the way it tries to locate the user is by looking for patterns, in the giant computation of the universe, which satisfy these criteria. this fits in the general notion of generalized computation interpretability, which is fundamentally needed to care about the world because you want to detect not just simulated moral patients, but arbitrarily complexly simulated moral patients. so, you need this anyways, and it is what "looking inside the world to find stuff, no matter how it's encoded" looks like.

🟣 misato — and what sort of patterns are we looking for? what are the types here?

🟡 ritsuko — as far as i understand, PreDCA looks for programs, or computations, which take some input and return an policy. my own idea is to locate something less abstract, about which we can actually have information-theoretic guarantees: bitstrings.

🟣 misato — …just raw bitstrings?

🟡 ritsuko — that's right. the idea here is kinda like doing an incantation, except the incantation we're locating is a very large piece of data which is unlikely to be replicated outside of this world. imagine generating a very large (several gigabytes) file, and then asking the AI "look for things of information, in the set of all computations, which look like that pattern." we call "blobs" such bitstrings serving as *anchors into to find our world and location-within-it in the set of possible world-states and locations-within-them.

#### 7. blob location

🟡 ritsuko — for example, let's say the universe is a conway's game of life. then, the AI could have a set of hypotheses as programs which take as input the entire state of the conway's game of life grid at any instant, and returning a bitstring which must be equal to the blob.

🟡 ritsuko — first, we define $Ω≔{ω|ω∈𝒫(ℤ2),#ω∈ℕ}$ (uppercase omega, a set of lowercase omega) as the set of "world-states" — states of the grid, defined as the set of cell positions whose cell is alive.

🟢 shinji — what's $𝒫(ℤ2)$ and $#ω$?

🟡 ritsuko$ℤ2$ is the set of pairs whose elements are both a member of $ℤ$, the set of relative integers. so $Z2$ is the set of pairs of relative integers — that is, grid coordinates. then, $𝒫(ℤ2)$ is the set of subsets of $ℤ2$. finally, $#w$ is the size of set $w$ — requiring that $#w∈ℕ$ is akin to requiring that $w$ is a finite set, rather than infinite. let's also define:

• $𝔹={⊤,⊥}$ as the set of booleans
• $𝔹*$ as the set of finite bitstring
• $𝔹n$ is the set of bitstrings of length $n$
• $|b|$ is the length of bitstring $b$

🟡 ritsuko — what do you think "locate blob $b∈𝔹*$ in world-state $ω∈Ω$" could look like, mathematically?

🟣 misato — let's see — i can use the set of bitstrings of same length as $b$, which is $𝔹|b|$. let's build a set of ${f|f∈Ω→𝔹|b|…$

🟢 shinji — wait, $Ω→𝔹|b|$ is the set of functions from $Ω$ to $𝔹|n|$. but we were talking about programs from $Ω$ to $𝔹|b|$. is there a difference?

🟡 ritsuko — this is a very good remark, shinji! indeed, we need to do a bit more work; for now we'll just posit that for any sets $A,B$, $A→HB$ is the set of always-halting, always-succeeding programs taking as input an $A$ and returning a $B$.

🟣 misato — let's see — what about ${f|f∈Ω→H𝔹|b|,f(ω)=b}$?

🟡 ritsuko — you're starting to get there — this is indeed the set of programs which return $b$ when taking $ω$ as input. however, it's merely a set — it's not very useful as is. what we'd really want is a distribution over such functions. not only would this give a weight to different functions, but summing over the entire distribution could also give us some measure of "how easy it is to find $b$ in $ω$. remember the definition of distributions, $ΔX$?

🟢 shinji — oh, i remember! it's the set of functions in $X→[0;1]$ which sum up to at most one over all of $X$.

🟡 ritsuko — indeed! so, we're gonna posit what i'll call kolmogorov simplicity, $KX−∈ΔX∩X→(0;1)$, which is like kolmogorov complexity except that it's a distribution, never returns 0 nor 1 for a single element, and importantly it returns something like the inverse of complexity. it gives some amount of "mass" to every element in some (countable) set $X$.

🟣 misato — oh, i know then! the distribution, for each $f∈Ω→H𝔹|b|$, must return ${KΩ→H𝔹*−(f)iff(ω)=b0iff(ω)≠b$

🟡 ritsuko — that's right! we can start to define $Locn∈Ω×𝔹n→ΔΩ→H𝔹n$ as the function that takes as input a pair of world-state $ω∈Ω$ and blob $b∈𝔹n$ of length $n$, and returns a distribution over programs that "find" $b$ in $ω$. plus, since functions $f$ are weighed by their kolmogorov simplicity, for complex $b$'s they're "encouraged" to find the bits of complexity of $b$ in $ω$, rather than those bits of complexity being contained in $f$ itself.

🟡 ritsuko — note also that this $Locn$ distribution over $Ω→H𝔹n$ returns, for any function $f$, either $KΩ→H𝔹n−$ or $0$, which entails that for any given $ω,b$, the sum of $Locn(ω,b)(f)$ for all $f$'s sums up to less than one — that sum represents in a sense "how hard it is to find $b$ in $ω$" or "the probability that $b$ is somewhere in $ω$".

$f∀(ω,b)∈Ω×𝔹n:∑Locn(ω,b)(f)<1f∈Ω→H𝔹n$

🟡 ritsuko — the notation here, $Locn(ω,b)(f)$ is because $Locn(ω,b)$ returns a distribution $ΔΩ→H𝔹n$, which is itself a function $(Ω→H𝔹n)→[0;1]$ — so we apply $Loc$ to $ω,b$, and then we sample the resulting distribution on $f$.

🟢 shinji — "the sum represents"? what do you mean by "represents"?

🟡 ritsuko — well, it's the concept which i'm trying to find a "true name" for, here. "how much is the blob $b$ located in world-state $ω$? well, as much of the sum of the kolmogorov simplicity of every program that returns $b$ when taking as input $ω$".

🟣 misato — and then what? i feel like my understanding of how this ties into anything is still pretty loose.

🟡 ritsuko — so, we're actually gonna get two things out of $Loc$: we're gonna get how much $ω$ contains $b$ (as the sum of $Loc$ for all $f$'s), but we're also gonna get how to get another world-state that is like $ω$, except that $b$ is replaced with something else.

🟢 shinji — how are we gonna get that??

🟡 ritsuko — here's my idea: we're gonna make $f(ω)$ return not just $𝔹*$ but rather $𝔹n×𝔹*$ — a pair of the blob of a "free bitstring" $τ$ (tau) which it can use to store "everything in the world-state except $b$". and we'll also sample programs $g∈𝔹n×𝔹*→HΩ$ which "put the world-state back together" given the same free bitstring, and a possibly different counterfactual blob than $b$.

🟣 misato — so, for $ω,b$, $Loc$ is defined as something like…

🟢 shinji stares at the math for a while — actually, shouldn't the $if$ statement be more general? you don't just want $g$ to work on $b$, you want $g$ to work on any other blob of the same length.

🟡 ritsuko — that's correct shinji! let's call the original blob $b$ the "factual blob", let's call other blobs of the same length we could insert in its stead "counterfactual blobs" and write them as $b′$ — we can establish that $′$ (prime) will denote counterfactual things in general.

🟣 misato — so it's more like…

🟣 misato — …$g(b′,τ)$ should equal, exactly?

🟡 ritsuko — we don't know what it should equal, but we do know something about what it equals: $f$ should work on that counterfactual and find the same counterfactual blob again.

🟡 ritsuko — actually, let's make $Locn$ be merely a distribution over functions that produce counterfactual world-states from counterfactual blobs $𝔹n→Ω$ — let's call those "counterfactual insertion functions" and denote them $γ$ and their set $Γn$ (gamma) — and we'll encapsulate $τ$ away from the rest of the math:

$f,g,τLocn(ω,b)(γ)≔∑KΩ→H𝔹n×𝔹*−(f)⋅K𝔹n×𝔹*→HΩ−(g)f∈Ω→H𝔹n×𝔹*g∈𝔹n×𝔹*→HΩf(ω)=(b,τ)∀b′∈𝔹n:f(g(b′,τ))=(b′,τ)γ(b′)=g(b′,τ)$

🟢 shinji — isn't $f(g(b′,τ))=(b′,τ)$ a bit circular?

🟡 ritsuko — well, yes and no. it leaves a lot of degrees of freedom to $f$ and $g$, perhaps too much. let's say we had some function $SimilarPasts∈Ω×Ω→[0;1]$ — let's not worry about how it works. then could weigh each "blob location" by how much counterfactual world-states are similar, when sampled over all counterfactual blobs.

🟣 misato — maybe we should also constrain the $f,g$ programs for how long they take to run?

🟡 ritsuko — ah yes, good idea. let's say that for $x∈X$ and $f∈X→HY$, $R(f,x)∈ℕ\{0}$ is how long it takes to run program $f$ on input $x$, in some amount of steps each doing a constant amount of work — such as steps of compute in a turing machine.

$f,g,τb′Locn(ω,b)(γ)≔∑KΩ→H𝔹n×𝔹*−(f)⋅K𝔹n×𝔹*→HΩ−(g)⋅∑1#𝔹n⋅SimilarPasts(ω,g(b′,τ))R(g,(b′,τ))+R(f,g(b′,γ))f∈Ω→H𝔹n×𝔹*b′∈𝔹ng∈𝔹n×𝔹*→HΩf(ω)=(b,τ)∀b′∈𝔹n:f(γ(b′))=(b′,τ)γ(b′)=g(b′,τ)$

🟡 ritsuko — (i've also replaced $f(g(b′,τ))$ with $f(γ(b′))$ since that's shorter and they're equal anyways)

🟣 misato — where does the first sum end, exactly?

🟡 ritsuko — it applies to the whole– oh, you know what, i can achieve the same effect by flattening the whole thing into a single sum. and renaming the $b′$ in $∀b′∈𝔹n$ to $b′′$ to avoid confusion.

$f,g,τ,b′Locn(ω,b)(γ)≔∑KΩ→H𝔹n×𝔹*−(f)⋅K𝔹n×𝔹*→HΩ−(g)⋅1#𝔹n⋅SimilarPasts(ω,g(b′,τ))R(g,(b′,τ))+R(f,g(b′,τ))f∈Ω→H𝔹n×𝔹*g∈𝔹n×𝔹*→HΩb′∈𝔹nf(ω)=(b,τ)∀b′′∈𝔹n:f(γ(b′′))=(b′′,τ)γ(b′′)=g(b′′,τ)$

🟢 shinji — are we still operating in conway's game of life here?

🟡 ritsuko — oh yeah, now might be a good time to start generalizing. we'll carry around not just world-states $ω∈Ω$, but initial world-states $α∈Ω$ (alpha). those are gonna determine the start of universes — distributions of world-states being computed-over-time — and we'll use them when we're computing world-states forwards or comparing the age of world-states. for example $SimilarPasts$ probably needs this, so we'll need to pass it to $Locn$ which will now be of type $Ω×Ω×𝔹n→ΔΓn$:

$f,g,τ,b′Locn(α,ω,b)(γ)≔∑KΩ→H𝔹n×𝔹*−(f)⋅K𝔹n×𝔹*→HΩ−(g)⋅1#𝔹n⋅SimilarPastsα(ω,g(b′,τ))R(g,(b′,τ))+R(f,g(b′,τ))f∈Ω→H𝔹n×𝔹*g∈𝔹n×𝔹*→HΩb′∈𝔹nf(ω)=(b,τ)∀b′′∈𝔹n:f(γ(b′′))=(b′′,τ)γ(b′′)=g(b′′,τ)$

#### 8. constrained mass notation

🟢 shinji — i notice that you're multiplying together your "kolmogorov simplicities" and $1#𝔹n$ and now $SimilarPasts$ divided by a sum of how long they take to run. what's going on here exactly?

🟡 ritsuko — well, each of those number is a "confidence amount" — scalars between 0 and 1 that say "how much does this iteration of the sum capture the thing we want", like probabilities. multiplication $⋅$ is like the logical operator "and" $∧$ except for confidence ratios, you know.

🟢 shinji — ah, i see. so these sums do something kinda like "expected value" in probability?

🟡 ritsuko — something kinda like that. actually, this notation is starting to get unwieldy. i'm noticing a bunch of this pattern: $x∑SomeDistribution(x)⋅expressionx∈SomeSet$

🟣 misato — so, if you want to use the standard probability theory notations, you need random variables which–

🟡 ritsuko — ugh, i don't like random variables, because the place at which they get substituted for the sampled value is ambiguous. here, i'll define my own notation:

$v1,…,vpv1,…,vpM[V]≔∑X1(x1)⋅…⋅Xn(xn)⋅Vx1:X1x1∈domain(X1)⋮⋮xn:Xnxn∈domain(Xn)C1C1⋮⋮CmCm$

🟡 ritsuko$𝐌$ will stand for "constrained mass", and it's basically syntactic sugar for sums, where $x:X$ means "sum over $x∈domain(X)$ (where $domain$ returns the set of arguments over which a function is defined), and then multiply each iteration of the sum by $X(x)$". now, we just have to define uniform distributions over finite sets as…

🟢 shinji$UniformX(x)≔1#X$ for finite set $X$?

🟡 ritsuko — that's it! and now, $Loc$ is much more easily written down:

$f,g,τ,b′Locn(α,ω,b)(γ)≔𝐌[SimilarPastsα(ω,g(b′,τ))R(g,(b′,τ))+R(f,g(b′,τ))]f:KΩ→H𝔹n×𝔹*−g:K𝔹n×𝔹*→HΩ−b′:Uniform𝔹nf(ω)=(b,τ)∀b′′∈𝔹n:f(γ(b′′))=(b′′,τ)γ(b′′)=g(b′′,τ)$

🟢 shinji — huh. you know, i'm pretty skeptical of you inventing your own probability notations, but this is much more readable, when you know what you're looking at.

🟣 misato — so, are we done here? is this blob location?

🟡 ritsuko — well, i expect that some thing are gonna come up later that are gonna make us want to change this definition. but right now, the only improvement i can think of is to replace $f:KΩ→H𝔹n×𝔹*−$ and $g:K𝔹n×𝔹*→HΩ−$ with $(f,g):K(Ω→H𝔹n×𝔹*)×(𝔹n×𝔹*→HΩ)−$.

🟣 misato — huh, what's the difference?

🟡 ritsuko — well, now we're sampling $f,g$ from kolmogorov simplicity at the same time, which means that if there is some large piece of information that they both use, they won't be penalized for using it twice but only once — a tuple containing two elements which have a lot of information in common only has that information counter once by $K−$.

🟣 misato — and we want that?

🟡 ritsuko — yes! there are some cases where we'd want two mathematical objects to have a lot of information in common, and other places where we'd want them to not need to be dissimilar. here, it is clearly the former: we want the program that "deconstructs" the world-state into blob and everything-else, and the function that "reconstructs" a new world-state from a counterfactual blob and the same everything-else, to be able to share information as to how they do that.

#### 9. what now?

🟢 shinji — so we've put together a true name for "piece of data in the universe which can be replaced with counterfactuals". that's pretty nifty, i guess, but what do we do with it?

🟡 ritsuko — now, this is where the core of my idea comes in: in the physical world, we're gonna create a random unique enough blob on someone's computer. then we're going to, still in the physical world, read its contents right after generating it. if it looks like a counterfactual (i.e. if it doesn't look like randomness) we'll create another blob of data, which can be recognized by $Loc$ as an answer.

🟢 shinji — what does that entail, exactly?

🟡 ritsuko — we'll have created a piece of real, physical world, which lets use use $Loc$ to get the true name, in pure math, of "what answer would that human person have produced to this counterfactual question?"

🟣 misato — hold on — we already have this. the AI can already have an interface where it asks a human user something, and waits for our answer. and the problem with that is that, obviously, the AI hijacks us or its interface to get whatever answer makes its job easiest.

🟡 ritsuko — aha, but this is different! we can point at a counterfactual question-and-answer chunk-of-time (call it "question-answer counterfactual interval", or "QACI") which is before the AI's launch, in time. we can mathematically define it as being in the past of the AI, by identifying the AI with some other blob which we'll also locate using $Loc$, and demand that the blob identifying the AI be causally after the user's answer.

🟣 misato — huh.

🟡 ritsuko — that's another idea i got from PreDCA — making the AI pursue the values of a static version of its user in its past, rather than its user-over-time.

🟢 shinji — but we don't want the AI to lock-in our values, we want the AI to satisfy our values-as-they-evolve-over-time, don't we?

🟣 misato — well, shinji, there's multiple ways to phrase your mistake, here. one is that, actually, you do — but if you're someone reasonable, then the values you endorse are some metaethical system which is able to reflect and learn about what's good, and to let people and philosophy determine what can be pursued.

🟣 misato — but you do have values you want to lock in. your meta-values, your metaethics, you don't want those to be able to change arbitrarily. for example, you probly don't want to be able to become someone who wants everyone to maximally suffer. those endorsed, top-level, metaethics meta-values, are something you do want to lock in.

🟡 ritsuko — put it another way: if you're reasonable, then if the AI asks you what you want inside the question-answer counterfactual interval, you won't answer "i want everyone to be forced to watch the most popular TV show in 2023". you'll answer something more like "i want everyone to be able to reflect on their own values and choose what values and choices they endorse, and how, and that the field of philosophy can continue in these ways in order to figure out how to resolve conflicts", or something like that.

🟣 misato — wait, if the AI is asking the user counterfactual questions, won't it ask the user whatever counterfactual question brainhacks the user into responding whatever answer makes its job easiest? it can just hijack the QACI.

🟡 ritsuko — aha, but we don't have to have the AI formulate answers! we could do something like: make the initial question some static question like "please produce an action that saves the world", and then the user thinks about it for a bit, returns an answer, and that answer is fed back into another QACI to the user. this loops until one of the user responds with an answer which starts with a special string like "okay, i'm done for sure:", followed by a bunch of text which the AI will interpret as a piece of math describing a scoring over actions, and it'll try to output a utility function which maximizes that.

🟢 shinji — so it's kinda like coherent extrapolated volition but for actions?

🟡 ritsuko — sure, i think of it as an implementation of CEV. it allows its user to run a long-reflection process. actually, that long-reflection process even has the ability to use a mathematical oracle.

🟣 misato — how does that work?

#### 10. blob signing & closeness in time

🟡 ritsuko — so, let's define $QACI$ as a function, and this'll clarify what's going on. $q∈𝔹*$ will be our initial random factual question blob. $QACI∈Ω×Γ|q|×𝔹|q|→Δ𝔹|q|$ takes as parameter a blob location for the question — which, remember, comes in the form of a function you can use to produce counterfactual world-states with counterfactual blobs! — and a counterfactual question blob $q′$, and returns a distribution of possible answers $r$. it's defined as:

$ωr,γrQACI(α,γq,q′)(r)≔𝐌ωr:Ωα→(γq(q′))γr:Loc|q|(α,ωr,r)$

🟡 ritsuko — we're, for now just positing, that there is a function $Ωα→∈Ω→ΔΩ$ (remember that $α$ defines a hypothesis for the initial state, and mechanics, of our universe) which, given a world-state, returns a distribution of world-states that are in its future. so this piece of math samples possible future world-states of the counterfactual world-state where $q$ was replaced with $q′$, and possible locations of possible answers in those world-states.

🟣 misato$𝐌$? what does that mean?

🟡 ritsuko — here, the fact that $Locn(α,ω,b)$ doesn't necessarily sum to 1 — we say that it doesn't normalize — means that $QACI(α,γq,q′)(r)$ summed up over all $r∈𝔹|q|$ can be less than 1. in fact, this sum will indicate "how hard is it to find the answer $r$ in futures of counterfactual world-states $γq(q′)$?" — and uses that as the distribution of answers.

🟣 misato — hmmm. wait, this just finds whichever-answers-are-the-easiest-to-find. what guarantees that $r$ looks like an answer at all?

🟡 ritsuko — this is a good point. maybe we should define something like $Sign∈𝔹*→𝔹|q|$ which, to any input "payload" of a certain length, associates a blob which is actually highly complex, because $Sign$ embeds a lot of bits of complexity. for example, maybe $Sign(π)$ (where $π$ is the "payload") concatenates $π$ together with a long cryptographic hash of $π$ and of some piece of information highly entangled with our world-state.

$ωr,γrQACI(α,γq,q′)(πr)≔𝐌ωr:Ωα→(γq(q′))γr:Loc|q|(α,ωr,Sign(πr))$

🟢 shinji — we're not signing the counterfactual question $q′$, only the answer payload $πr$?

🟡 ritsuko — that's right. signatures matter for blobs we're finding; once we've found them, we don't need to sign counterfactuals to insert in their stead.

🟣 misato — so, it seems to me like how $Ω→$ works here, is pretty critical. for example, if it contains a bunch of mass at world-states where some AI is launched, whether ours or another, then that AI will try to fill its future lightcone with answers that would match various $Sign(πr)$'s — so that our AI would find those answers instead of ours — and make those answers be something that maximize their utility function rather than ours.

🟡 ritsuko — this is true! indeed, how we sample for $Ω→$ is pretty critical. how about this: first, we'll pass the distribution into $Loc$:

$γrQACI(α,γq,q′)(πr)≔𝐌γr:Loc|q|(α,Ωα→(γq(q′)),Sign(πr))$

🟡 ritsuko — …and inside $Locn$, which is now of type $Locn∈Ω×ΔΩ×𝔹n→ΔΓn$, for any $f,g$ we'll only sample world-states $ω$ which have the highest mass in that distribution:

$f,g,ω,τ,b′Locn(α,δ,b)(γ)≔𝐌[SimilarPastsα(ω,g(b′,τ))R(g,(b′,τ))+R(f,g(b′,τ))](f,g):K(Ω→H𝔹n×𝔹*)×(𝔹n×𝔹*→HΩ)−ω:λω:maxXΔ(λω:Ω.{δ(ω)iff(ω)=(b,τ)0otherwise).δ(ω)b′:Uniform𝔹nf(ω)=(b,τ)∀b′′∈𝔹n:γ(b′′)=g(b′′,τ)f(γ(b′′))=(b′′,τ)$

🟡 ritsuko — the intent here is that for any way-to-find-the-blob $f,g$, we only sample the closest matching world-states in time — which does rely on $Ω→$ having higher mass for world-states that are closer in time. and hopefully, the result is that we pick enough instances of the signed answer blobs located shortly in time after the question blobs, that they're mostly dominated by the human user answering them, rather than AIs appearing later.

🟣 misato — can you disentangle the line where you sample $ω$?

🟡 ritsuko — sure! so, we write an anonymous function $λω:X.δ(ω)$ — a distribution is a function, after all! — taking a parameter $ω$ from the set $X$, and returning $δ(ω)$. so this is going to be a distribution that is just like $δ$, except it's only defined for a subset of $Ω$ — those in $X$.

🟡 ritsuko — in this case, $X$ is defined as such: first, take the set of elements $ω∈Ω$ for which $f(ω)=(b,τ)$. then, apply the distribution $δ$ to all of them, and only keep elements for which they have the most $δ$ (there can be multiple, if multiple elements have the same maximum mass!).

🟡 ritsuko — oh, and i guess $f(ω)=(b,τ)$ is redundant now, i'll erase it. remember that this syntax means "sum over the body for all values of $f,g,ω,τ,b′$ for which these constraints hold…", which means we can totally have the value of $τ$ be bound inside the definition of $ω$ like this — it'll just have exactly one value for any pair of $f$ and $α$.

#### 11. QACI graph

🟢 shinji — why is $QACI$ returning a distribution over answers, rather than picking the single element with the most mass in the distribution?

🟡 ritsuko — that's a good question! in theory, it could be that, but we do want the user to be able to go to the next possible counterfactual answer if the first one isn't satisfactory, and the one after that if that's still not helpful, and so on. for example: in the piece of math which will interpret the user's final result as a math expression, we want to ignore answers which don't parse or evaluate as proper math of the intended type.

🟢 shinji — so the AI is asking the counterfactual past-user-in-time to come up with a good action-scoring function in… however long a question-answer counterfactual interval is.

🟡 ritsuko — let's say about a week.

🟢 shinji — and this helps… how, again?

🟡 ritsuko — well. first, let's posit $EvalMathX∈𝔹*→{{x}|x∈X}∪{∅}$, which tries to parse and evaluate a bitstring representing a piece of math (in some pre-established formal language) and returns either:

• what it evaluates to if it is a member of $X$
• an empty set if it isn't a member of $X$ or fails to parse or evaluate

🟡 ritsuko — we then define $EvalMathXΔ∈ΔΠ→X$ as a function that returns the highest-mass element of the distribution for which $EvalMathX$ returns a value rather than the empty set. we'll also assume for convenience $q*′∈*→𝔹|q|$, a convenience function which converts any mathematical object into a counterfactual blob $𝔹|q|$. this isn't really allowed, but it's just for the sake of example here.

🟣 misato — okay…

🟡 ritsuko — so, let's say the first call is . the user can return any expression, as their action-scoring function — they can return $λa:A.SomeUtilityMeasure(a)$ (a function taking an action $a$ and returning some utility measure over it), but they can also return where $U≔A→[0;1]$ is the set of action-scoring functions. they get to call themselves recursively, and make progress in a sort of time-loop where they pass each other notes.

🟣 misato — right, this is the long-reflection process you mentioned. and about the part where they get a mathematical oracle?

🟡 ritsuko — so, the user can return things like:

$EvalMathUΔ(QACI(α,γq,q*′(SomeUncomputableQuery())))$

$EvalMathUΔ(QACI(α,γq,q*′(Halts(SomeProgram,SomeInput))))$.

🟣 misato — huh. that's nifty.

🟢 shinji — what if some weird memetic selection effects happen, or what if in one of the QACI intervals, the user randomly gets hit by a truck and then the whole scheme fails?

🟡 ritsuko — so, the user can set up giant giant acyclic graphs of calls to themselves, providing a lot of redundancy. that way, if any single node fails to return a coherent output, the next nodes can notice this and keep working with their peer's output.

🟡 ritsuko — a small graph of QACI can be accomplished with something like

🟡 ritsuko — and you can imagine how this could be extended to a larger structure such as: #### 12. fictional agents

🟢 shinji — wait a minute; what if the QACI user goes crazy from realizing that they're fictional, inside an AI's hypothesis, rather than in the physical world-state? especially if they have access to a goddamn math oracle.

🟡 ritsuko — ah. uuh. well, don't panic shinji, but, we are fictional. right now.

🟢 shinji starts flailing around.

🟣 misato — what the hell are you talking about?

🟡 ritsuko — misato, come on. we're inside a crazy high-tech compound designed to build and maintain giant robots to fight alien creatures, and we're discussing the entirely unrelated topic of AI alignment. oh, and, uh, there's this:

🔵 rei appears in front of them and waves at the camera.

🟣 misato — …so, why are we talking about alignment, instead of, i don't know, trying to escape?

🟡 ritsuko — misato, this is fiction, you don't escape. instead, you reason about what agents are writing your fiction. based on prior, we should vaguely guess that they're like us, and that the conversation we're having is useful for them to write a fiction about. my guess is, we're a fiction written by someone who's trying to communicate ideas about AI alignment and, for some godforsaken reason i cannot imagine, decided that we should be the characters having this conversation.

🟡 ritsuko — but don't be mistaken: if we are indeed fictional, then the future of this world we're observing barely matters. our parent world likely holds a lot more realityfluid. so we shou–

🟢 shinji, shaking — here comes this "realityfluid" bullshit again!

🟡 ritsuko — anyways, the coherent thing for fictional characters to do is to try to steer their parent world, because it's generally "realer". and generally, this is achieved by cooperating with whatever their parent world seems to be doing, so we should probably just continue to discuss AI alignment.

🟡 ritsuko — and, in a similar manner, there are people who grok these facts sufficiently well that we would expect them to take the correct actions, even upon realizing that they are the fictional consideration of a superintelligence. given the explanation i just gave, i would claim that i could do it myself, for one.

🔵 rei magically erases everyone's memory of this section of the dialogue, and then disappears, out of view from everyone.

#### 13. top-level QACI call

🟢 shinji — so we have $QACI(α,γq,q′)$, the function which can call itself recursively. what's the top-level, terminal call to it which yields the action-scoring function?

🟡 ritsuko — ah, i think it'd look like:

$πrQACI0(α,γq)(u)≔𝐌πr:QACI(α,γq,q0′)u∈EvalMathU(πr)$

🟡 ritsuko — where $q0′$ is some initial counterfactual blob, such as the plaintext string "please return a good scoring function over actions" encoded in ASCII, and then padded with zeros to be of the size needed for a blob. $QACI0$ has type $Γ|q|→ΔU$ — from a question location, it returns a distribution of action-scoring functions.

🟣 misato — so like, the counterfactual user inside the $QACI$ call should be able to return math that calls more $QACI$, but where do they get the $α$ and $γq$?

🟢 shinji — couldn't they return the whole math?

🟡 ritsuko — ah, that's not gonna work — the chance of erroneous blob locations might accumulate too much if each $QACI$ does a new question location sampling; we want something more realiable. an easy solution is to $EvalMath$ the text not into a $U$, but into a $Ω×Γ|q|→U$ and to pass it $α,γq$ so that the user can return a function which receives those and uses them to call $QACI$.

🟡 ritsuko — actually, while we're at it, we can pass a it whole lot more things it might need…

$πr,fQACI0(α,γq)(u)≔𝐌πr:QACI(α,γq,q0′)f∈EvalMath{q}×Ω×Γ|q|→U(πr)f(q,α,γq)=u$

🟢 shinji — what's going on with $f$ here?

🟡 ritsuko — oh, this is just a trick of how we implement distributions — when measuring the mass of any specific $u$, we try to $EvalMath$ the answer payload into a function $f$, and we only count the location when $u$ is equal to $f(q,α,γq)$ with useful parameters passed to it.

🟣 misato — what's around $QACI0$? where do $α$ and $γq$ come from?

🟡 ritsuko — so… remember this?

$hScore(a)≔𝐌[LooksLikeThisWorld(h)⋅HowGood(a,h)]h:Prior$

🟡 ritsuko — this is where we start actually plugging in our various parts. we'll assume some distribution over initial world-states $Ωα∈ΔΩ$ and sample question locations $γq$ in futures of those initial world-states — which will serve, for now, as the $LooksLikeThisWorld$.

$α,γqScore(a)≔𝐌[QACI0(α,γq)(a)]α:Ωαγq:Loc|q|(α,Ωα→(α),q)$

🟡 ritsuko — the actual AI we use will be of a type like $U→HA$, and so we can just call $AI(Score)$, and execute its action guess.

🟣 misato — and… that's it?

🟡 ritsuko — well, no. i mean, the whole fundamental structure is here, but there's still a bunch of work we should do if we want to increase the chances that this produces the outcomes we want.

#### 14. location prior

🟡 ritsuko — so, right now each call to $Loc$ penalizes $f,g$ for being being too kolmogorov-complex. we could take advantage of this by encouraging our two different blob locations — the question location and the answer location — to share bits of information, rather than coming up with their own, possibly different bits of information. this increases the chances that the question is located "in a similar way" to the answer.

🟣 misato — what does this mean, concretely?

🟡 ritsuko — well, for example, they could have the same bits of information for how to find bits of memory on a computer's memory on earth, encoded in our physics, and then the two different $Loc$'s $f$ and $g$ functions would only differ in what computer, what memory range, and what time they find their blobs in.

🟡 ritsuko — for this, we'll define a set of "location priors" being sampled as part of the hypothesis that $Score$ samples over — let's call it $Ξ$ (xi). we might as well posit $Ξ≔𝔹*$.

🟡 ritsuko — we'll also define $KP,X−~:P→ΔX$ a kolmogorov simplicity measure which can use another piece of information, as, let's see…

$KP,X−~(p)(x)≔KP×X−(p,x)$

🟡 ritsuko — there we go, measuring the simplicity of the pair of the prior and the element favors information being shared between them.

🟣 misato — wait, this fails to normalize now, doesn't it? because not all of $P×X$ is sampled, only pairs whose first element is $p$.

🟡 ritsuko — ah, you're right! we can simply normalize this distribution to solve that issue.

$KP,X−~(p)≔NormalizeX(λx:X.KP×X−(p,x))$

🟡 ritsuko — and in $Score$ we'll simply add $ξ:KΞ−$ and then pass $ξ$ around to all blob locations:

$α,ξ,γqScore(u)≔𝐌[QACI0(α,γq,ξ)(u)]α:Ωαξ:KΞ−γq:Loc|q|(α,Ωα→(α),q,ξ)$

$QACI0∈Ω×Γ|q|×Ξ→ΔU$

$πr,fQACI0(α,γq,ξ)(u)≔𝐌πr:QACI(α,γq,q0′,ξ)f∈EvalMath{q}×Ω×Γ|q|×Ξ→U(πr)f(q,α,γq,ξ)=u$

🟡 ritsuko — finally, we'll use it in $Loc$ to sample $f,g$ from:

$Locn∈Ω×ΔΩ×𝔹n×Ξ→ΔΓn$

$f,g,ω,τ,b′Locn(α,δ,b,ξ)(γ)≔𝐌[SimilarPastsα(ω,g(b′,τ))R(g,(b′,τ))+R(f,g(b′,τ))](f,g):KΞ,(Ω→H𝔹n×𝔹*)×(𝔹n×𝔹*→HΩ)−~(ξ)ω:λω:maxXΔ(λω:Ω.{δ(ω)iff(ω)=(b,τ)0otherwise).δ(ω)b′:Uniform𝔹n∀b′′∈𝔹n:γ(b′′)=g(b′′,τ)f(γ(b′′))=(b′′,τ)$

🟡 ritsuko — here's an issue: currently in $Score$, we're weighing hypotheses by how hard it is to find both the question and the answer.

🟡 ritsuko — do you think that's wrong?

🟣 misato — i think we should first ask for how hard it is to find questions, and then normalize the distribution of answers, so that harder-to-find answers don't penalize hypotheses. the reasoning behind this is that we want QACI graphs to be able to do a lot of complicated things, and that we hope question location is sufficient to select what we want already.

🟡 ritsuko — ah, that makes sense, yeah! thankfully, we can just normalize right around the call to $QACI0$, before applying it to $u$:

$α,ξ,γqScore(u)≔𝐌[NormalizeU(QACI0(α,γq,ξ))(u)]α:Ωαξ:KΞ−γq:Loc|q|(α,Ωα→(α),q,ξ)$

🟢 shinji — what happens if we don't get the blob locations we want, exactly?

🟡 ritsuko — well, it depends. there are two kinds of "blob mislocations": "naive" and "adversarial" ones. naive mislocations are hopefully not a huge deal; considering that we're doing average scoring over all scoring functions weighed by mass, hopefully the "signal" from our aligned scoring functions beats out the "noise" from locations that select the wrong thing at a random place, like "boltzmann blobs".

🟡 ritsuko — adversarial blobs, however, are tougher. i expect that they mostly result from unfriendly alien superintelligences, as well as earth-borne AI, both unaligned ones and ones that might result from QACI. against those, i hope that inside QACI we come up with some good decision theory that lets us not worry about that.

🟣 misato — actually, didn't someone recently publish some work on a threat-resistant utility bargaining function, called "Rose"?

🟡 ritsuko — oh, nice! well in that case, if $Rose$ is of type $ΔU→U$, then we can simply wrap it around all of $Score$:

$α,ξ,γqScore≔Rose(λu:U.𝐌[NormalizeU(QACI0(α,γq,ξ))(u)])α:Ωαξ:KΞ−γq:Loc|q|(α,Ωα→(α),q,ξ)$

🟡 ritsuko — note that we're putting the whole thing inside an anonymous $λ$-function, and assigning to $Score$ the result of applying $Rose$ to that distribution.

#### 16. observations

🟢 shinji — you know, i feel like there ought to be some better ways to select hypotheses that look like our world.

🟡 ritsuko — hmmm. you know, i do feel like if we had some "observation" bitstring $μ∈𝔹*$ (mu) which strongly identifies our world, like a whole dump of wikipedia or something, that might help — something like $γμ:Loc|μ|(α,Ωα→(α),μ,ξ)$. but how do we tie that into the existing set of variables serving as a sampling?

🟣 misato — we could look for the question $q$ in futures of the observation world-state– how do we get that world-state again?

🟡 ritsuko — oh, if you've got $γμ$ you an reconstitute the factual observation world-state with $γμ(μ)$.

🟣 misato — in that case, we can just do:

$α,ξ,γμ,γqScore≔Rose(λu:U.𝐌[NormalizeU(QACI0(α,γq,ξ))(u)])α:Ωαξ:KΞ−γμ:Loc|μ|(α,Ωα→(α),μ,ξ)γq:Loc|q|(α,Ωα→(γμ(μ)),q,ξ)$

🟡 ritsuko — oh, neat! actually, couldn't we generate two blobs and sandwich the question blob between the two?

🟣 misato — let's see here, the second observation can be $μ2$

$α,ξ,γμ1,γμ2,γqScore≔Rose(λu:U.𝐌[NormalizeU(QACI0(α,γq,ξ))(u)])α:Ωαξ:KΞ−γμ1:Loc|μ1|(α,Ωα→(α),μ1,ξ)γμ2:Loc|μ2|(α,Ωα→(γμ1(μ1)),μ2,ξ)γq:Loc|q|(α,Ωα→(γμ1(μ1)),q,ξ)$

🟣 misato — how do i sample the $γq$ location from both the future of $γμ1$ and the past of $γμ2$?

🟡 ritsuko — well, i'm not sure we want to do that. remember that $Loc$ tries to find the very first matching world-state for any $f,g$. instead, how about this:

$α,ξ,γμ1,γμ2,γqScore≔Rose(λu:U.𝐌[NormalizeU(QACI0(α,γq,ξ))(u)])α:Ωαξ:KΞ−γμ1:Loc|μ1|(α,Ωα→(α),μ1,ξ)γμ2:Loc|μ2|(α,Ωα→(γμ1(μ1)),μ2,ξ)γq:Loc|q|(α,Ωα→(γμ2(μ2)),q,ξ)Ωα→(γq(q))(γμ2(μ2))>Ωα→(γμ2(μ2))(γq(q))$

🟡 ritsuko — it's a bit hacky, but we can simply demand that "the $μ2$ world-state be in the future of the $q$ world-state more than the $q$ world-state is in the future of the $μ2$ world-state".

🟣 misato — huh. i guess that's… one way to do it.

🟢 shinji — could we encourage the blob location prior to use the bits of information from the observations? something like…

$α,ξ,γμ1,γμ2,γqScore≔Rose(λu:U.𝐌[NormalizeU(QACI0(α,γq,ξ))(u)])α:Ωαξ:K𝔹*×𝔹*,Ξ−~(μ1,μ2)γμ1:Loc|μ1|(α,Ωα→(α),μ1,ξ)γμ2:Loc|μ2|(α,Ωα→(γμ1(μ1)),μ2,ξ)γq:Loc|q|(α,Ωα→(γμ2(μ2)),q,ξ)Ωα→(γq(q))(γμ2(μ2))>Ωα→(γμ2(μ2))(γq(q))$

🟡 ritsuko — nope. because then, $Loc$'s $f$ programs can simply return the observations as constants, rather than finding them in the world, which defeats the entire purpose.

🟣 misato — …so, what's in those observations, exactly?

🟡 ritsuko — well, $μ2$ is mostly just going to be $μ1$ with "more, newer content". but the core of it, $μ1$, could be a whole lot of stuff. a dump of wikipedia, a callable of a some LLM, whatever else would let it identify our world.

🟢 shinji — can't we just, like, plug the AI into the internet and let it gain data that way or something?

🟡 ritsuko — so there's like obvious security concerns here. but, assuming those were magically fixed, i can see a way to do that: $μ1$ could be a function or mapping rather than a bitstring, and while the AI would observe it as a constant, it could be lazily evaluated. including, like, $Fetch(Url)$ could be a fully memoized function — such that the AI can't observe any mutable state — but it would still point to the world. in essence, this would make the AI point to the entire internet as its observation, though of course it would in practice be unable to obtain all of it. but it could navigate it just as if it was a mathematical object.

🟣 misato — interesting. though of course, the security concerns make this probably unviable.

🟡 ritsuko — hahah. yeah. oh, and we probably want to pass $μ1,μ2$ inside $QACI0$:

$πr,fQACI0(α,γq,ξ)(u)≔𝐌πr:QACI(α,γq,q0′,ξ)f∈EvalMath{q}×{μ1}×{μ2}×Ω×Γ|q|×Ξ→U(πr)f(q,μ1,μ2,α,γq,ξ)=u$

#### 17. where next

🟣 misato — so, is that it then? are we done?

🟡 ritsuko — hardly! i expect that there's a lot more work to be done. but this is a solid foundation, and direction to explore. it's kind of the only thing that feels like a path to saving the world.

🟢 shinji — you know, the math can seem intimidating at first, but actually it's not that complicated. one can figure out this math, especially if they get to ask questions in real time to the person who invented that math.

🟡 ritsuko — for sure! it should be noted that i'm not particularly qualified at this. my education isn't in math at all — i never really did math seriously before QACI. the only reason why i'm making the QACI math is that so far barely anyone else will. but i've seen at least one other person try to learn about it and come to understand it somewhat well.

🟢 shinji — what are some directions which you think are worth exploring, for people who want to help improve QACI?

🟡 ritsuko — oh boy. well, here are some:

• find things that are broken about the current math, and ideally help fix them too.
• think about utility function bargaining more — notably, perhaps scores are regularized, such as maybe by weighing ratings that are more "extreme" (further away from $12$) as less probable. alternatively, maybe scoring functions have a finite amount of "votestuff" that they get to distribute amongst all options the way a normalizing distribution does, or maybe we implement something kinda like quadratic voting?
• think about how to make a lazily evaluated observation viable. i'm not sure about this, but it feels like the kind of direction that might help avoid unaligned alien AIs capturing our locations by bruteforcing blob generation using many-worlds.
• generally figure out more ways to ensure that the blob locations match the world-states we want — both by improving $Loc$ and $Sign$, and by finding more clever ways to use them — you saw how easy it was to add two blob locations for the two observations $μ1,μ2$.
• think about turning this scheme into a continuous rather than one-shot AI. (possibly exfohazardous, do not publish)
• related to that, think about ways to make the AI aligned not just with regards to its guess, but also with regards to its side-effects, so as to avoid it wanting to exploit its way out. (possibly exfohazardous, do not publish)
• alternatively, think about how to box the AI so that the output with regards to which it is aligned is its only meaningful source of world-steering.
• one thing we didn't get into much is what could actually be behind $Ω$, $Ω→$, and $SimilarPasts$. you can read more about those here, but i don't have super strong confidence in the way they're currently put together. in particular, it would be great if someone who groks physics a lot more than me thought about whether many-worlds gives unaligned alien superintelligences the ability to forge any blob or observation we could put together in a way that would capture our AI's blob location.
• maybe there are some ways to avoid this by tying the question world-state with the AI's action world-state? maybe implementing embedded agency helps with this? note that blob location can totally locate the AI's action, and use that to produce counterfactual action world-states. maybe that is useful. (possibly exfohazardous, do not publish)
• think about $Sign$ and the $ExpensiveHash$ function (see the full math post) and how to either implement it or achieve a similar effect otherwise. for example, maybe instead of relying on an expensive hash, we can formally define that $f,g$ need to be "consequentialist agents trying to locate the blob in the way we want", rather than any program that works.
• think about how to make counterfactual QACI intervals resistant to someone launching unaligned superintelligence within them.

🟣 misato — ack, i didn't really think of that last one. yeah, that sounds bad.

🟡 ritsuko — yup. in general, i could also do with people who could help with inner-alignment-to-a-formal-goal, but that's a lot more hazardous to work on. hence why we have not talked about it. but there is work to be done on that front, and people who think they have insights should probly contact us privately and definitely not publish them. interpretability people are doing enough damage to the world as it is.

🟢 shinji — well, things don't look great, but i'm glad this plan is around! i guess it's something.

🟡 ritsuko — i know right? that's how i feel as well. lol.

🟣 misato — lmao, even. posted on 2023-06-10 — also cross-posted on lesswrong, see there for comments unless otherwise specified on individual pages, all posts on this website are licensed under the CC_-1 license.
unless explicitely mentioned, all content on this site was created by me; not by others nor AI.