one approach to AI alignment is this: develop technology to analyze what is going on within an AI, in order to determine what it's thinking. this is called interpretability. the most generalized view of this is would involve something like demon detection in arbitrary computations.
in addition, in order to implement our values, an aligned AI should deeply care about them, which is to say: it should care about implementing those values even in arbitrarily encoded computations. it's not enough that the humans in the simulation don't suffer, they should also be unable to run computers which contain other humans which suffer, as well as many other weird ways suffering moral patients can occur. for example, it should ban all homomorphic encryption it can't decrypt, because otherwise it might be missing on some suffering moral patients.
i claim that there is a commonality to those two things: the detection of deeply, arbitrarily encoded computations. in one for demons, in the other for moral patients. i wonder if there are models of computation, perhaps laden with proofs of benign-ness, which make detecting those systemically doable.
this doesn't necessarily have to do with value-laden things; the theory of generalized interpretability could in general be used to determine objectively whether a given computation deeply contains, say, rule 30.
furthermore, it seems like for this theory to determine whether "computation X deeply contains computation Y", we would need to specify Y in a profound way which might be the kind of format we'd need values to be in for general alignment. as an example, an aligned AI could be tasked with running whichever computations contain persons but do not contain suffering; and the kind of specification those would look like would, in order to be able to apply deeply, need to be fully general. note that i'm not sure they would need to be fully general, but my intuition points that way.