when it comes to solving the outer alignment problem — by which i mean the general question "what goal should we want an AI to pursue?" — there are two main failure modes which i see approaches fall into.
on the one hand, we can't have the AI try to satisfy its user continuously over time by getting feedback from them, such as by doing Reinforcement Learning with Human Feedback (RLHF), because the AI will want to hijack either the system through which the user gives it feedback, or hijack/deceive the user itself, to make its goal simpler to satisfy.
on the other hand, we can't have the AI just deduce human values from a limited corpus of information — even if we could somehow reliably tell the AI how to extract values we'd want from that corpus, there is no reason to think that there's a process that can reliably extrapolate our complex, rich values from that limited information. as yudkowsky says in Value is Fragile:
To change away from human morals in the direction of improvement rather than entropy, requires a criterion of improvement; and that criterion would be physically represented in our brains, and our brains alone.
this is part of why outer alignment is difficult. it is also why i consider approaches such as PreDCA to have some good potential for avoiding those failure modes: it makes the AI motivated to actually figure out the values of the physical user in the real world, but it has to be the values of the user before they created the AI — the person inside of the AI's future lightcone doesn't count, and so the AI can't hijack the user it's trying to satisfy.
some static training data could help it as a prior, or to give it evidence about its past user, or to indicate to it who the past user even is ("please satisfy the past person who said this!"), but not as the ultimate grounding for the AI's goals.
the AI might still ask questions to the user who is still around, but that would be only to get more information about what the user would have wanted before creating the AI. the AI might want to do something like a high-resolution brain-scan of the user, to get a lot of information about what the past-user probly valued.
yes, this entails value lock-in; this is desirable, because you want you current axiomatic meta-values to guide your non-axiomatic-values in the future — it is actually possible for future non-axiomatic values to change over time in a way that is bad rather than good, according to our current axiomatic values.
in short: if the AI is trying to satisfy either a static corpus it's trained on or a continuously living user, it's doing the wrong thing. if the AI is trying to investigate the world to figure out what its user would have wanted before creating the AI, then we can use that to steer it in the right direction.