while i think on its own it doesn't solve that much, i don't think building an AI box is particularly difficult.
i am not talking about letting an AI have full dominion over a computer but physically isolating it; physical limitations are hard to predict and unreliable. i am talking about mathematically isolating a piece of software, using known sandboxing techniques.
for example, consider an AI whose code is encoded as SKI calculus, running on a simple SKI calculus calculator. or an AI whose code is encoded into conway's game of life, running on a simple conway's game of life simulator. these programs can be defined in such a way that their only failure mode is stopping due to running out of resources.
remaining vulnerabilities involve things like rowhammer, or meltdown and spectre; but those shouldn't be too hard to contain so long as the level of interpretation of the AI is sufficiently abstracted away from hardware that the configurations that lead to these sorts of exploits just don't occur — in fact, considering how easy it is to make very abstracted turing complete systems, designing a fully sandboxed and highly-abstracted-from-hardware turing complete environment running a boxed AI should be doable with only a constant-factor slowdown. the information theoritic limit of this would be something like "reverse" homomorphic encryption, where the inside of the encrypted computation is information-theoretically guaranteed to be unable to control how its process affects the hardware around it (as long as the encryption scheme holds, of course) — though i am unsure whether such a scheme exists, or even makes sense to consider.
as long as the computation's output isn't used for anything, i posit that this is reasonably safe. the difficult part is designing output ports that are safe. however, even without that, we can do some cool stuff such as making an reinforcement-based AI's reward function unhackable.
if the boxed AI is fully deterministic, then it has no notion of anything going on in the real world like how fast time is passing or what hardware it's running on; all it has to go on to guess is the structure of its boxed environment, which can be made simple enough that the AI guessing anything about us seems information-theoretically unlikely enough — for example, i would think that an AI living in a seemingly classical computation would be unlikely to guess that the outside world runs on quantum mechanics. giving the boxed AI input ports is theoretically safe, but significantly amplifies any risks due to insufficient output safety — if the AI does have access to outside information, it gets evidence as to what kind of world it is trying to affect.
what could a safe output port look like? we could use the output port to receive proofs to theorems which are verifiable, for example. i believe discussion of how to turn this into pivotal acts is discussed in some of the first late 2021 MIRI conversations. however, we may prefer queries which verifiably have only one correct answer (like "what is the ordered list of prime factors of this number"; unlike "what is a proof of this theorem"); if there are multiple correct answers, then that's some degree of freedom the AI could use to affect us, especially if any human happens to glance upon the output data.