(this post, cross-posted on lesswrong, has been written for the third Refine blog post day)
ordering capability thresholds
given an AI which is improving towards ever more capabilities, such as by way of recursive self-improvement, in what order will it pass the following points?
throughout this post i'll be using PreDCA as an example of a formal goal to be maximized, because it appears to me as a potentially promising direction; but you can imagine adapting this post to other formal goals such as insulated goal-programs, or other alignment strategies altogether. we can even use this time-ordering framework to compare the various thresholds of multiple alignment strategies, though i won't do that here.
- Start: we start the AI
- Math: it can figure out relatively complicated math, such as whether P equals PSPACE, or whether this world looks like it has finite compute if we can make it do physics.
- PreDCA: it can figure out what is entailed in maximizing PreDCA — notably that that goal is best entailed by not destroying the earth too much
- sub-PreDCA: it can figure out some individual parts of PreDCA, such as the identity of the user or what is entailed in maximizing a human's utility function, in a way that we can use to modify those parts if they need adjusting
- Escape: it becomes able to escape the environment over which we have control — and typically starts replicating across the internet
- Influence: it gets the ability to significantly influence the timeline, for example enough to eg save us from facebook destroying everything six months later
- DSA: it achieves decisive strategic advantage
- Doom: it becomes capable of destroying the earth too much (without necessarily using that capability)
- Cone: it takes over a significant portion of the universe, or at least of the lightcone
with a few notes:
- "decisive strategic advantage" is a term i'm taking from bostrom's superintelligence book, describing the point at which an AI has sufficiently ensured its continuation that we can't turn it off or change its goals anymore; it is effectively the point of no return.
- by "destroying the earth too much" i mean destroying so much of earth that it can't reasonably be resimulated. if resimulating earth is too unethical, computationally costly, or anthropically costly, then "destroying too much of earth" might straightforwardly mean destroying all of humankind or something like that. note that for PreDCA, preserving earth in some way is important not just because it's pretty bad that we all die, but also because the AI might need to preserve its user and possibly their environment in order to figure out their utility function.
- in the case of knowing mathematical statements (Math, PreDCA, and sub-PreDCA), i imagine the AI being pretty sure about them, not necessarily having proven them. in addition, for simplicity, i'm assuming that we can use the AI to figure out some mathematical fact if and only if the AI can figure it out for itself — in practice, this need not be the case.
one thing that can be noticed is that humans might serve as evidence. for example, we can examine history to figure out whether we passed Math or would've been able to pass PreDCA (given a reasonable description of it) before getting to Doom — my guess is yes at least for that latter one.
now, we can reasonably guess the following pieces of ordering, where as usual in ordering graphs X → Y means X < Y and transitive edges are not shown.
in addition, for any two quantities X < Y, it can be the case that they're pretty close in time X ≈ Y, or it can be that there's a bunch of time between them X ≪ Y. whether the threshold between those two possibilities is more like a day or a year, is gonna depend on context.
depending on how the rest of the ordering graph turns out and how close pairs of subsequent events are in time, we can be in a variety of situations:
- if PreDCA ≪ Influence we may get to see how PreDCA will work out, and adjust it a lot of needed. if Influence < PreDCA ≪ DSA, then the timeline might have started diverting a bunch by then, but we can still adjust the AI. if instead DSA < PreDCA then we have to hope that the complete PreDCA indeed produces good worlds.
- in a similar way, if sub-PreDCA ≪ Influence or at least Influence < sub-PreDCA ≪ DSA, then we get to test some individual parts of PreDCA on their own — otherwise, it better be correct.
- if Doom < PreDCA, or worse if Doom < sub-PreDCA, then even if the goal we programmed the AI with does actually aim at good worlds, our survival is not guaranteed; and we might only get a much weaker form of eventual alignment where the AI later says "oops i destroyed everything" and then tries to vaguely realize a utility function it has only limited information about.
- if Math ≪ Escape or at least Math ≪ DSA, then we might get to ask questions that help us figure out the alignment landscape better, such as whether earth is resimulable in reasonable time by a non-quantum program, or whether there is infinite compute.
- i expect that Escape ≈ Doom; that is, i expect that once it escapes its initial environment, the cat's out of the bag and we quickly lose control of the timeline, and then get killed if the AI is not aligned already. but the world might put up a fight (Influence ≪ DSA), or we might get some time to enjoy the show (DSA ≪ Doom).
- if Influence ≪ Escape then we get to have it steer the timeline in hopefully good directions while it's still in our control, though it's not necessarily going to be easy to determine whether the influence it's having is good or bad. if Escape < Influence ≪ DSA, then we might get a "warning shot" situation, where we get to see the world significantly changed and nevertheless still have some chance of stopping the AI; the desirability and consequences of doing that depends on the AI's alignment curve. DSA ≈ Influence is what AI takes control overnight looks like; DSA ≪ Influence is the AI taking control of the world without us realizing, only to start utilizing that power to visibly change the world afterwards, as in biding its time scenarios.
- i'm hopeful that we can make it that Start ≪ Escape by building a reasonably boxed environment, but if it fooms very fast and figures out deception/blackmail then software-boxing it isn't going to help much.
- Start ≈ Influence represents very fast takeoff scenarios where we barely get to look at what's going on before the AI has started significantly altering the world.
- whether sub-PreDCA ≈ PreDCA or sub-PreDCA ≪ PreDCA will determine if PreDCA is to be tested in its entirety, or whether there's a chance we can test its individual parts before putting the whole thing together. but as long as PreDCA < Influence or at least PreDCA < DSA, then it's fine if sub-PreDCA ≈ PreDCA, because we can still test the whole thing.
- if either DSA < Math ≪ Doom or DSA < sub-PreDCA ≪ Doom, then our fate is locked in when DSA is passed and we can't do anything about it anymore, but i guess at least we might get to know some information about where we're headed.
finally, some claims that i strongly disbelieve in can still be expressed within this capabilities ordering framework, such as E ≪ D or that, given a theoretical maximum level of AI cabability Max, Max < Doom or even Max < DSA.