posted on 2023-03-01 — also cross-posted as a response to a post on lesswrong, see there for comments

on strong/general coherent agents

i expect the thing that kills us if we die, and the thing that saves us if we are saved, to be strong/general coherent agents (SGCA) which maximize expected utility. note that this is two separate claims; it could be that i believe the AI that kills us isn't SGCA, but the AI that saves us still has to be SGCA. i could see shifting to that latter viewpoint; i currently do not expect myself to shift to believing that the AI that saves us isn't SGCA.

to me, this totally makes sense in theory, to imagine something that just formulates plans-over-time and picks the argmax for some goal. the whole of instrumental convergence is coherent with that: if you give an agent a bunch of information about the world, and the ability to run eg linux commands, there is in fact an action that maximizes the amount of expected paperclips in the universe, and that action does typically entail recursively self-improving and taking over the world and (at least incidentally) destroying everything we value. the question is whether we will build such a thing any time soon.

right now, we have some specialized agentic AIs: alphazero is pretty good at reliably winning at go; it doesn't "get distracted" with other stuff. to me, waiting for SGCA to happen is like waiting for a rocket to get to space in the rocket alignment problem: once the rocket is in space it's already too late. the whole point is that we have to figure this out before the first rocket gets to space, because we only get to shoot one rocket to space. one has to build an actual inside view understanding of agenticity, and figure out if we'll be able to build that or not. and, if we are, then we need to solve alignment before the first such thing is built — you can't just go "aha, i now see that SGCA can happen, so i'll align it!" because by then you're dead, or at least past its decisive strategic advantage.

i'm not sure how to convey my own inside view of why i think SGCA can happen, in part because it's capability exfohazardous. maybe one can learn from IEM or the late 2021 MIRI conversations? i don't know where i'd send someone to figure this out, because i think i largely derived it from the empty string myself. it does strongly seem to me that, while a single particular neural net might not be the first thing to be an SGCA, we can totally bootstrap SGCA from existing ML technology; it might just take a clever trick or two rather than being the completely direct solution of "oh you train it like this and then it becomes SGCA". recursive self-improvement is typically involved.

we also have some AIs, including sydney, which aren't SGCA. it might even be that SGCA is indeed somewhat unnatural for a lot of current deep learning capabilities. nevertheless, i believe such a thing is likely enough to be built that it's what it takes for us to die — maybe non-SGCA AI's impact on the economy would slowly disempower us over the course of 20~40 years, but in those worlds AI tech gets good enough that 5 years into it someone figures out the right clever trick to build (something that bootstraps to) SGCA and we die of agentic intelligence explosion very fast before we get to see the slow economic disempowerement. in addition, i believe that our best shot is to build an aligned SGCA.

why haven't animals or humans gotten to SGCA? well, what would getting from messy biological intelligences to SGCA look like? typically, it would look like one species taking over its environment while developing culture and industrial civilization, overcoming in various ways the cognitive biases that happened to be optimal in its ancestral environment, and eventually building more reliable hardware such as computers and using those to make AI capable of much more coherent and unbiased agenticity.

that's us. this is what it looks like to be the first species to get to SGCA. most animals are strongly optimized for their local environment, and don't have the capabilities to be above the civilization-building criticality threshold that lets them build industrial civilization and then SGCA AI. we are the first one to get past that threshold; we're the first one to fall in an evolutionary niche that lets us do that. this is what it looks like to be the biological bootstrap part of the ongoing intelligence explosion; if dogs could do that, then we'd simply observe being dogs in the industrialized dog civilization, trying to solve the problem of aligning AI to our civilized-dog values.

we're not quite SGCA ourselves because, turns out, the shortest path from ancestral-environment-optimized life to SGCA is to build a successor that is much closer to SGCA. if that successor is still not quite SGCA enough, then its own successor will probly be. this is what we're about to do, probly this decade, in industrial civilization. maybe if building computers was much harder, and brains were more reliable to the point that rational thinking was not a weird niche thing you have to work on, and we got an extra million years or two to evolutionarily adapt to industrialized society, then we'd become properly SGCA. it does not surprise me that that is not, in fact, the shortest path to SGCA.

posted on 2023-03-01 — also cross-posted as a response to a post on lesswrong, see there for comments

CC_ -1 License unless otherwise specified on individual pages, all posts on this website are licensed under the CC_-1 license.
unless explicitely mentioned, all content on this site was created by me; not by others nor AI.