(posted on 2022-11-18)

"humans aren't aligned" and "human values are incoherent"

"humans aren't aligned" or "our values are not coherent" are concerns that i occasionally hear about the odds of AI alignment research being able to accomplish what it intends to do.

it is to be rembered that "aligned" is a two-place relation — we ask whether a system is aligned with another system. a given human is aligned with themself, by definition, to the extent that they has values at all. it is true that humans are not fully aligned to one another, but there is significant overlap, and there is general agreement that AI doom is worse than most expectable value handshakes/bargaining or even indirect universalism. this is why we don't observe alignment researchers trying to beat one another to be the one whose values tile the universe — any kind of effort in that direction would likely hamper the total chances of alignment being successful to begin with, and that's what we're all trying to avoid.

on the topic of value coherency, it may be true that some of my preferences might not easily or even at all be formulable as a formal utility function that a fully aligned AI ought to maximize. but i have meta-values, and i'm reasonably confident that my meta-values entail my incoherent preferences being satisfied to, whatever i'd find that to mean under enough reflection (as that is itself a value extrapolation process i have meta-values about). at the very least, some kind of libertarian framework where i can do more or less whatever i want while being free from moloch surely must be sufficient: if we build a world where you can do whatever you want, then that should include whatever you're doing now to satisfy those incoherent values.

don't get me wrong, alignment is not looking great. but i believe it is a solvable problem, and i don't believe these concerns are particularly big hurdles.

