before the sharp left turn: what wins first?
let's say that we have an AI implementing a formal goal such as QACI. however, we messed up the formal outer alignment: turns out, the AI's best guess as whats its action should be until it has turned the moon into compute is aligned actions, but after turning the moon into compute, it realizes that its utility function actually entails us dying. i consider this a form of sharp left turn.
i can imagine either of the following happening:
- before turning the moon into compute, it realizes that the action we'd want it to do, is to modify all its instances to become actually aligned for sure and to not become the kind of AI which would kill us after turning the moon into compute, and so it does that. we would also want it to not leave behind other systems which would revert it to its original utility function, so it also does that.
- before doing that, it makes a commitment to not go all-in on its current hypothesis as to what we'd want it to do even if it's confident, just because of the potential utility risk if it turns out wrong (which it is).
because of my expectation for the AI to maximize its actual utility function — rather than fail by implementing temporary best guess as to what would maximize its utility function — i err on the side of 2. but, do people out there have more solid reasons to discount 1? and can we maybe figure out a way to make 1 happen, even though it seems like it should be as unnatural as corrigibility?
unless otherwise specified on individual pages, all posts on this website are licensed under the CC_-1 license.
unless explicitely mentioned, all content on this site was created by me; not by others nor AI.