In this post, I propose a plan for addressing superintelligence-based risks.
Before I say anything, I will mention a crucial point that a bunch of people have ignored despite it being addressed at the bottom of this post: the idea I describe here is very unlikely to work. I'm proposing it because of other plans because I feel like other plans are extremely unlikely to work (see also this post). Yes, we probably can't do this in time. That doesn't make it not our best shot. Rationalists select the best plan, not "the first plan, and then after that a new plan only if it seems good enough".
(spoilers for the premise of orthogonal by Greg Egan) In Orthogonal, a civilization facing annihilation comes up with a last-minute plan: to create a ship, accelerate it until its time arrow is orthogonal to the time arrow of its home world (which is possible thanks to the alternate physics of their world), and thus give its crew as much time as it needs to figure out how to save their homeworld before reversing course and coming back. This plan is inspired by that, and i'm naming this post after their ship, the Peerless.
The short version is: we design a simulation for a bunch of people (probably rationalists) to live in and figure out alignment with as much time as they need, and create a superintelligence whose sole goal is to run that simulation and implement a new goal it will eventually decide on. I've written about this idea previously, but that post is not required reading; this is a more fleshed-out view.
I will be describing the plan in three steps.
We need virtual persons inside this world. They will be the ones who figure out alignment. A few possibilities come to my mind; there may be more.
The main risk that's been brought to my attention regarding this part is the following: what if the virtual persons end up unaligned from their previous selves? The brain scan scenario seems like the most likely to have that risk, but even then i'm not too worried about it; intuitively, it seems unlikely enough to me that all the uploaded persons would come out misaligned in a similar direction, and in a similar direction that would lead them to decide on a botched alignment for the superintelligence.
An obvious question here is: who gets to be on board the simulation? The values of the people who get uploaded might significantly affect what the superintelligence is aligned to (not all humans necessarily have the same values, maybe even after thinking about it really hard for a long time). I don't have any answers other than the obvious "me please!" and "my tribe please!" that occur to me.
Note that i'm not proposing augmenting the uploaded minds — at least not for the first simulation iteration (see below). That does seem like an exceedingly risky prospect, alignment-wise, and one we don't need to commit to right away.
Those persons will live in a virtual environment, within which they'll hopefully figure out alignment. However, the environment needs to be a deterministic computation, such that the "outer" superintelligence (the one running the virtual environment) has no ability to affect its outcome; its goal will only be to "implement whatever this computation decides". If the superintelligence wants to implement the actual result of the actual computation, and that computation is fully deterministic, (and if we don't simulate anything complex enough for that superintelligence to "leak back in"), then it has no room to meddle with what we do in it! It's stuck running us until we decide on something.
Some things we need to figure out include:
The people inside this simulation will have somewhere between plenty and infinite time and compute to figure out alignment. If they do have infinite compute, and if the cosmos isn't full of consequentialists competing for earlyness in the universal distribution (or other things that might make wasting compute bad), then we can even run exponential-or-longer computations in, from our perspective, instant time; we just need to be sure we don't run anything malign and unbounded — although the risks from running malign stuff might be mitigated by the computations being fully and provably sandboxable, and we can shut them down whenever we want as long as they don't get to output enough to convince us not to. After all, maybe there are some bits of information that are the result of very large malign-dominated computations, that can nevertheless still be of use to us.
I mentioned before that maybe only slow computers are available; running a "very large" computation might require a majority vote or something like it. Or we can just boot without any computers at all and spend the first few millenia designing slow computers that are actually safe, and then work from there — when we have all the time we want, and maybe-infinite potential compute, a lot of options open up.
One downside is that we will be "flying blind". The outer superintelligence will gleefully turn the cosmos into computronium to ensure it can run us, and will be genociding everything back meat-side, in our reachable universe — or beyond, if for example physics is hackable, as wolfram suggests might be possible. Superintelligence might even do that first, and then boot our simulation. Hopefully, if we want to, we can resimulate-and-recover aliens we've genocided after we've solved alignment, just like hopefully we can resimulate-and-recover the rest of earth; but from inside the simulation we won't be able to get much information at least in the first iteration. We can, however, end our iteration by agreeing on a new iteration that has some carefully-designed access to outside information, if we think we can safely do that; but nothing guarantees us that there will still be something salvageable outside.
Another way to model "successive simulation iterations, each deterministic, but each having the ability to make the next one not deterministic with a large enough vote" is as a single simulation that isn't quite deterministic, but made of large deterministic chunks separated by small controlled I/O accesses; think of it as a haskell computation that lazily evaluates everything right up until it waits for an input, and then as soon as it has that it can continue computing more.
Still, the current outlook is that we genocide everything including ourselves. Even if nothing else is recoverable, "a tiny human population survives and eventually repopulates" still seems like a better plan than the current expected outcome of "everything dies forever."
Now, this is the "easy part": just make a superintelligence that destroys everything to implement its one simple goal; except instead of paperclips, the simple goal is "implement whatever goal is the result of this very big turing machine".
We can either build and start that superintelligence as soon as we can, or keep it ready while we stay on our regular world. I'd probably advocate for the former just to be safe, but it can depend on your beliefs about quantum immortality, S-risks, and such. In any case, having something that might work ready to fire is certainly better than the current we just die.
Of course, it is crucial that we make the superintelligence after we have designed and implemented the virtual environment, complete with its virtual persons (or its deterministic procedure to obtain them); we don't want it to be able to influence what goal we give it, so we likely need to have the goal ready and "plugged in" from the start.
Some risks are:
This is a plan with a bunch of things that need work, but it doesn't seem absurdly hard to me; if anything, step 1 seems like the hardest, and I don't even know that we've tried throwing billions of dollars at it.
I share yudkowsky's current gloomy outlook on AI. The current route of "hey, maybe we should study things vaguely related to harnessing what neural nets do, and hope to be able to grab a miracle should it come up" seems like a pretty bad plan. I think, in comparison, the plan I outline here has better chances.
It is to be remembered that my vision is competing not against the likelyhood of superintelligence emergence, but against the likelyhood that alignment works. If pursuing mostly alignment gives us a 1e-10 chance of survival, and pursuing mostly my plan gives us a 1e-8 chance of survival, then it doesn't matter that yes, superintelligence is still overwhelmingly likely to kill us; we should still favor my plan. See also: this post comparing plans.
I have cross-posted this on lesswrong; feel free to discuss the idea with me there.