tammy's blog about
AI alignment, utopia, anthropics, and more;
let's say that we have an AI implementing a formal goal such as QACI. however, we messed up the formal outer alignment: turns out, the AI's best guess as whats its action should be until it has turned the moon into compute is aligned actions, but after turning the moon into compute, it realizes that its utility function actually entails us dying. i consider this a form of sharp left turn.
i can imagine either of the following happening:
because of my expectation for the AI to maximize its actual utility function — rather than fail by implementing temporary best guess as to what would maximize its utility function — i err on the side of 2. but, do people out there have more solid reasons to discount 1? and can we maybe figure out a way to make 1 happen, even though it seems like it should be as unnatural as corrigibility?
unless otherwise specified on individual pages, all posts on this website are licensed under the CC_-1 license.
unless explicitely mentioned, all content on this site was created by me; not by others nor AI.