posted on 2023-02-26

some thoughts about terminal alignment

i think that, for solving alignment, it's important to be able to delegate as much of the work as possible to processes that could do a better job than what we can come up with now.

i currently plan to delegate not just world modeling and possibly embedded agency, but also to delegate finding an actual aligned utility function, by launching a CEV-like long-reflection process which hopefully solves it.

however, some people have expressed skepticism that those processes to which i want to delegate will be able to figure out anything at all, rather than passing the buck forever; that there are terminal alignment solutions, after which we'd be confident that things are locked-in as good. so, i'd like to list some potential candidates for such schemes; not necessarily for implementing them now, but for having some reassurance that one of them can be figured out eventually.

finally, it's important to rember that while human values might seem incoherent or hard to determine, i think partial solutions such as "just build this utopia and hopefully it's good enough" can still be very satisfactory. a utility function doesn't have to be "directly maximize these exact human values"; it can just be "maximize the number of computed steps of program X", where X can for example be a deterministic utopia program.

posted on 2023-02-26

CC_ -1 License unless otherwise specified on individual pages, all posts on this website are licensed under the CC_-1 license.
unless explicitely mentioned, all content on this site was created by me; not by others nor AI.