what i call "formal alignment" is an approach to solving AI alignment that consists of:
those two points correspond to formal alignment's notions of outer and inner alignment, respectively: determining what formal thing to align the AI to, and figuring out how to build something that is indeed aligned to it without running into inner misalignment issues.
for reasons why i think this is the least hopeless path to saving the world, see my outlook on AI risk mitigation. the core motivation for formal alignment, for me, is that a working solution is at least eventually aligned: there is an objective answer to the question "will maximizing this with arbitrary capabilities produce desirable outcomes?" where the answer does not depend, at the limit, on what does the maximization. and the fact that such a formal thing is aligned in the limit makes it robust to sharp left turns. what remains then is just "bridging the gap": getting from eventual to continuous alignment, perhaps by ensuring the right ordering of attained capabilities.
potential formal alignment ideas include:
if there are formal alignment ideas i'm missing, please tell me about them and i'll add them here.
because these various proposals consist of putting together a formal mathematical expression, they rely on finding various true names. for example: PreDCA tries to put together the true names for causality, agency, and the AI's predecessor; IGP requires the true name for computing a program forwards; QACI requires a true name for identifying pieces of data in causal worlds, and replacing them with counterfactual alternatives; UAT requires the true names for parent universe/simulation, control over resources, and comparing amounts of resources with those in the AI's future lightcone.