posted on 2022-11-19

wonky but good enough alignment schemes

i'll say that an AI alignment scheme is "aligned enough" not just if it consists of building a fully aligned singleton (FAS) which i'd trust to do what's good at any level of capabilities; i'd also say that an alignment scheme is "aligned enough" if it leads, even indirectly, to the construction of such a FAS.

one potential way we could get this is by using some kind of assistant AI — perhaps a system that uses GPT or something — to figure out how to build a FAS. the assistant might not be eventually aligned; if it had enough capability, it migth realize that actually it wants to kill everyone. but we'd be relying on it being, for as long as we use it, weak enough to not do such a thing.

this kind of "wonky alignment scheme" that is "aligned enough" but goes through using temporarily aligned AIs, where we need to know that they're weak in order for the scheme to work, might end up useful given that such a temporarily aligned AI might be much easier to build than an eventually aligned, let alone continuously aligned, AI.

(maybe, actually, what we should be doing with our limited time and resources is not building FAS, but melting GPUs or something else like that. if we had a temporarily-aligned assistant AI which we get to charge with tasks, the task we should want to give it is a very general aligned goal such as satisfying its past-user, such that the AI imagining the past-user-under-reflection would be able to consider all those plans and pick the one that actually needs work, rather than working on what we think is most important.)

while i try to do work that directly contributes to the creation of a FAS, indirect "wonky" approaches such as using large language models in order to accelerate alignment research have their place too — it's not like one is particularly less hopeless than the other. my main source of skepticism about them is the requirements above, about the intermediary system being weak — that is not an assumption that i like to have to make about AIs that would have to be in some sense more capable than us, which they'd have to be if they are to make a difference.

