posted on 2022-08-17

alignment researchspace is potentially malign

alignment research leads to many very strange places. some of those could totally lead alignment researchers, with enough time or otherwise augmented thinking, or maybe even with the limited thinking time and capacity we have now, to stumble upon demonic forces we're too imperfect to avoid.

because of this, it isn't just inadequately implemented aligned AI which is vulnerable; inadequately implemented human cognition might be vulnerable as well. who knows what kind of traps we might fall in, and what the first one to be highly memetic could look like. and also, unlike AI, we can't implement into ourselves better decision theory.

in this landscape it is useful to focus on decision theory, and perhaps on plans which are willing to sacrifice some amount of certainty for the sake of expedience — not just because we're racing with capabilities, but now also to reduce the risk of encountering possibly-memetic demons. maybe we can design "computed-assisted research" software to help us systematically avoid those, or perhaps we can design plans where most of the aspects of alignment are automatically solved by software that bootstraps aligned superintelligent AI, rather than solved manually by humans. in the meantime, memetic/infohazard hygiene and containment policies would be good to develop.

(by the way, this is a good reason to think clippy might not even tile the universe with paperclips — it may fall prey to demons before/instead of implementing into itself the devices it needs to avoid them. alignment is not just the work of making the AI's values be our values — it's also the work of making the AI's values resilient and correctly pursued, as opposed to hijackable or value-driftable or subject to whatever other failure modes exist)

posted on 2022-08-17

CC_ -1 License unless otherwise specified on individual pages, all posts on this website are licensed under the CC_-1 license.
unless explicitely mentioned, all content on this site was created by me; not by others nor AI.