when someone claims to have an "easy" solution to aligning AI to human values, which are complex (have many bits of information), i like to ask: where are the bits of information of what human values are?
are the bits of information in the reward function? are they in how you selected your training data? are they in the prompt you intend to ask an AI? if you are giving it an entire corpus of data, which you think contains human values: even if you're right, the bits of information are in how you delimitate which parts of that corpus encode human value, a plausibly exponential task. classification is hard; "gathering all raw data" is easy, so that's not where the bits of hard work are.
this general information-theoritic line of inquiry, i think, does a good job at pointing to why aligning to complex values is likely, actually hard; not just plausibly, maybe hard. we don't "might maybe need" to do the hard work, we do likely need to do the hard work.