avatar

cognitive biases regarding the evaluation of AI risk when doing AI capabilities work

i have recently encountered a few rationality failures, in the context of talking about AI risk. i will document them here for reference; they probly have already been documented elsewhere, but their application to AI risk is particularly relevant here.

1. forgetting to multiply

let's say i'm talking with someone about the likelyhood that working on some form of AI capability kills everything everywhere forever. they say: "i think the risk is near 0%". i say: "i think the risk is maybe more like 10%".

would i bet that it will kill everyone? no, 10% is less than 50%. but "what i bet" isn't the only relevant thing; a proper utilitarian multiples likelyhood by quality of outcome. and X-risk is really bad. i mistakenly see some people use only the probability, forgetting to multiply; if i think everyone dying is not likely, that's enough for them. one should care that it's extremely unlikely.

2. categorizing vs average of risk

let's take the example above again. let's say you believe said likelyhood is close to 0% and i believe it's close to 10%; and let's say we each believe the other person generally tends to be as correct as oneself.

how should we come out of this? some people seem to want to pick an average between "carefully avoiding killing everyone" and "continuing as before" — which lets them more easily continue as before.

this is not how things should work. if i learn that someone who i generally consider about as likely as me to be correct about things, seriously thinks there's a 10% chance that my tap water has lead in it, my reaction is not "well, whatever, it's only 10% and only 1 out of the two of us believe this". my reaction is "what the hell?? i should look into this and stick to bottled water in the meantime". the average between risk and no risk is not "i guess maybe risk maybe no risk"; it's "lower (but still some) risk". the average between ≈0% and 10% is not "huh, well, one of those numbers is 0% so i can pick 0% and only have half a chance of being wrong"; the average is 5%. 5% is still a large risk.

this is kind of equivalent to forgetting to multiply, but to me it's a different problem: here, one is not just forgetting to multiply, one is forgetting that probabilities are numbers altogether, and is treating them as a set of discrete objects that they have to pick one of — and thus can justify picking the one that makes their AI capability work okay, because it's one out of the two objects.

3. deliberation ahead vs retroactive justification

someone says "well, i don't think the work i'm doing on AI capability is likely to kill everyone" or even "well, i think AI capability work is needed to do alignment work". that may be true, but how carefully did you arrive at that consideration?

did you sit down at a table with everybody, talk about what is safe and needed to do alignment work, and determine that AI capability work of the kind you're doing is the best course of actions to pursue?

or are you already committed to AI capability work and are trying to retroactively justify it?

i know the former isn't the case because there was no big societal sitting down at a table with everyone about cosmic AI risk. most people (including AI capability devs) don't even meaningfully know about cosmic AI risk; let alone deliberated on what to do about it.

this isn't to say that you're necessarily wrong; maybe by chance you happen to be right this time. but this is not how you arrive at truth, and you should be highly suspicious of such convenient retroactive justifications. and by "highly suspect" i don't mean "think mildly about it while you keep gleefully working on capability"; i mean "seriously sit down and reconsider whether what you're doing is more likely helping to save the world, or hindering saving the world".

4. it's not a prisoner's dilemma

some people think of alignment as a coordination problem. "well, unfortunately everyone is in a rat race to do AI capability, because if they don't they get outcompeted by others!"

this is not how it works. such prisoner's dilemmas work because if your opponent defects, your outcome if you defect too is worse than if you cooperate. this is not the case here; less people working on AI capability is pretty much strictly less probability that we all die, because it's just less people trying (and thus less people likely to randomly create an AI that kills everyone). even if literally everyone except you is working on AI capability, you should still not work on it; working on it would still only make things worse.

"but at that point it only makes things negligeably worse!"

…and? what's that supposed to justify? is your goal to cause evil as long as you only cause very small amounts of evil? shouldn't your goal be to just generally try to cause good and not cause evil?

5. we are utilitarian… right?

when situations akin to the trolley problem actually appear, it seems a lot of people are very reticent to actually press the lever. "i was only LARPing as a utilitarian this whole time! pressing the lever makes me feel way too bad to do it!"

i understand this and worry that i am in that situation myself. i am not sure what to say about it, other than: if you believe utilitarianism is what is actually right, you should try to actually act utilitarianistically in the real world. you should actually press actual levers in trolley-problem-like situations in the real world, not just nod along that pressing the lever sure is the theoretical utilitarian optimum to the trolley problem and then keep living as a soup of deontology and virtue ethics.

i'll do my best as well.

a word of sympathy

i would love to work on AI capability. it sounds like great fun! i would love for everything to be fine; trust me, i really do.

sometimes, when we're mature adults who take things seriously, we have to actually consider consequences and update, and make hard choices. this can be kind of fun too, if you're willing to truly engage in it. i'm not arguing with AI capabilities people out of hate or condescension. i know it sucks; it's painful. i have cried a bunch these past months. but feelings are no excuse to risk killing everyone. we need to do what is right.

shut up and multiply.


RSS feed available here; new posts are also linked on my twitter.
CC_ -1 License Unless otherwise specified on individual pages, all posts on this website are licensed under the CC_-1 license.
This site lives at https://carado.moe and /ipns/k51qzi5uqu5di8qtoflxvwoza3hm88f5osoogsv4ulmhurge2etp9d37gb6qe9.