let's call "hard alignment" the ("orthodox") problem, historically worked on by MIRI, of preventing strong agentic AIs from pursuing things we don't care about by default and destroying everything of value to us on the way there. let's call "easy" alignment the set of perspectives where some of this model is wrong — some of the assumptions are relaxed — such that saving the world is easier or more likely to be the default.
what should one be working on? as always, the calculation consists of comparing
hard) × how much value we can get in
easy) × how much value we can get in
because of how AI capabilities are going, i've seen for people start playing their outs — that is to say, to start acting as if alignment is easy, because if it's not we're doomed anyways. but i think, in this particular case, this is wrong.
this is the lesson of dying with dignity and bracing for the alignment tunnel: we should be cooperating with our counterfactual selves and continue to save the world in whatever way actually seems promising, rather than taking refuge in falsehood.
to me, p(
hard) is big enough, and my
hard-compatible plan seems workable enough, that it makes sense for me to continue to work on it.
let's not give up on the assumptions which are true. there is still work that can be done to actually generate some dignity under the assumptions that are actually true.