AI alignment curves
i can think of five different ways an AI's degree of alignment can change over time:
- unaligned from the start: an AI can just want to take over the world and maximize something we never cared about, and kill everyone in the process.
- sharp left turn: an AI starts out helping us, but then eventually turns out to be unaligned and kills everyone. note that this doesn't have to be shortly after starting the AI; it could for example be many millenia later, once it encounters another superintelligence and gets acausally hacked or something.
- increasingly aligned: some AI starts out not particularly aligned to our goals, but we correct it over time to care about what we want it to care about — being able to do this is typically the goal of "corrigible AI".
- continuously aligned: some AI starts fully aligned to some values we like, and robustly continues being aligned.
- eventually aligned: some AI starts out theoretically aligned to something we like, but goes through extended periods where it causes significant damage because it hasn't yet realized what needs preserving in order to maximize its values.
that last possibility is the main novelty i'm pointing to here. eventually aligned AI may be something such as PreDCA but with a poor ability to deduce the consequences of its mathematical goal, such that it first kills everyone or turns the entire earth into computronium as an instrumentally convergent goal, and then only afterwards realizes that that strongly goes against its utility function. but unless it can recover earth, it's too late: losing humans not only strongly goes against its goal, but also causes it to have lost a lot of information about its user (one of the humans), which might significantly hamper its ability to satisfy their utility function.
under potentially eventually aligned AI, the order in which AI realizes the consequences of its values is very important, because the world is fragile and it may cause a lot of damage before it's able to realize what implementing its values entails.
unless otherwise specified on individual pages, all posts on this website are licensed under the CC_-1 license.
unless explicitely mentioned, all content on this site was created by me; not by others nor AI.