avatar

AI alignment curves

i can think of five different ways an AI's degree of alignment can change over time:

that last possibility is the main novelty i'm pointing to here. eventually aligned AI may be something such as PreDCA but with a poor ability to deduce the consequences of its mathematical goal, such that it first kills everyone or turns the entire earth into computronium as an instrumentally convergent goal, and then only afterwards realizes that that strongly goes against its utility function. but unless it can recover earth, it's too late: losing humans not only strongly goes against its goal, but also causes it to have lost a lot of information about its user (one of the humans), which might significantly hamper its ability to satisfy their utility function.

under potentially eventually aligned AI, the order in which AI realizes the consequences of its values is very important, because the world is fragile and it may cause a lot of damage before it's able to realize what implementing its values entails.


CC_ -1 License Unless otherwise specified on individual pages, all posts on this website are licensed under the CC_-1 license.
This site lives at https://carado.moe and /ipns/k51qzi5uqu5di8qtoflxvwoza3hm88f5osoogsv4ulmhurge2etp9d37gb6qe9.