posted on 2022-10-28

counterfactual computations in world models

how could we make an AI which, faced with a decision, consults its past-user(s) about it, using something like the question-answer counterfactual interval device ? i believe that solving the following technical problem could significantly help us figure out how to do that.

suppose some relatively complex probabilistic World, for example one based on a probabilistic cellular automaton.

we'll also posit two functions f and g and a small but non-empty set S such that

f is meant to represent the true decision process of the user of the AI, while g is an approximation of it which happens to explain all the empirical data we got from f — that empirical data being { (x,f(x)) | x∈S }. g is meant to serve as a trap for AIs that would merely try to fit the empirical data without investigating the world to find the real f. if there are many possible functions which also output the same value as f on members of S, such that you have to look at the World a bunch before finding the real f, then that's a good reason to think that an AI that systematically finds the real f in various scenarios is doing what we want.

in World, we'll first arrange — either in advance or by interfering with World or with its source of randomness — for that world to encode an efficient implementation of f somewhere. this encoding will be called on all members of S in any order, and the record of these calls along with the results will be recorded somewhere.

then, still inside World, the encoded version of f will be scrambled — making recovering f from evidence non-trivial but doable with some work.

finally, the problem is the following: is there an AI which can learn f by predicting its result on values outside of S

the difficulty here is for AI to learn to predict f, when we're giving it the tempting hack of predicting g instead, as a sort of honeypot. i expect that, to do this, the AI would need to build a causal understanding of World, and then locate and reconstruct f from the evidence left around.

what are the inputs and outputs even for, if it has to dig around World to recover f? they're used for identifying f to begin with: f is the thing which has gotten those inputs and produced those outputs. the AI shouldn't think "i need to find a function which, given those inputs, gives those outputs" — as do, as i understand, all current ML systems — but one which goes "hm, apparently there was a physical thing somewhere which given those inputs, gave those outputs — what would it have, counterfactually, outputted given different inputs?" and it needs to do that without the ability to directly run f, such that it can't simply "be trained on f".

once we have this, we can hopefully start building an AI which is given a bunch of input-outputs pairs that have gone through human users, and then give it a decision process that relies on predicting what those users would have said given different queries, in order to make decisions.

posted on 2022-10-28

CC_ -1 License unless otherwise specified on individual pages, all posts on this website are licensed under the CC_-1 license.
unless explicitely mentioned, all content on this site was created by me; not by others nor AI.