(posted on 2022-06-10)

where are your alignment bits?

information theory lets us make claims such as "you probly can't compress this 1GB file into 1KB, given a reasonable programming language".

when someone claims to have an "easy" solution to aligning AI to human values, which are complex (have many bits of information), i like to ask: where are the bits of information of what human values are?

are the bits of information in the reward function? are they in how you selected your training data? are they in the prompt you intend to ask an AI? if you are giving it an entire corpus of data, which you think contains human values: even if you're right, the bits of information are in how you delimitate which parts of that corpus encode human value, a plausibly exponential task. classification is hard; "gathering all raw data" is easy, so that's not where the bits of hard work are.

this general information-theoritic line of inquiry, i think, does a good job at pointing to why aligning to complex values is likely, actually hard; not just plausibly, maybe hard. we don't "might maybe need" to do the hard work, we do likely need to do the hard work.

(posted on 2022-06-10)

CC_ -1 License unless otherwise specified on individual pages, all posts on this website are licensed under the CC_-1 license.