tammy's blog about
AI alignment, utopia, anthropics, and more;
when considering the mess that text encoding was before unicode (and notably UTF-8), one wouldn't be blamed for thinking that the problem of text encoding is basically solved. yet, there are many issues with unicode, some of which cannot be solved without discarding unicode entirely.
unicode is a character encoding with about a million codepoints, of which currently about 144k are assigned to characters by the unicode consortium.
UTF-8 is by far the most common representation of unicode, where each character is represented by a sequence of bytes; notably, UTF-8 is compatible with ASCII: every valid ASCII sequence of bytes represents the same text it does when interpreted as UTF-8.
chinese and japanese use a wide collection of logographic characters (respectively hanzi and kanji) that no doubt have evolved throughout history in how people use them the same way every other piece of language has.
that is, until formal text encoding — including unicode — came along. by hard-assigning a fixed set of characters to codepoints, these standards make users of those languages unable to create or even modify characters, even though the way kanji and hanzi work should make some combinations of radicals that don't currently exist possible both to mean new meanings or to simplify existing characters.
as a result, chinese and japanese are in effect partially dead languages in their written form.
one way unicode could go about this would be to encode those characters as geometric combinations of radicals, with maybe some extra bits of information to indicate various ways in which those radicals can combine.
that would be a lot of work, but it is theoretically feasible.
emoji are images used as units of language, now commonplace in internet communication as you've no doubt noticed. nonetheless, beyond the original japanese emoji imported into unicode, people have started developing and using platforms that let users use their own custom images as emoji. unicode simply cannot solve this issue, and it is a critical one: language is now flexible enough that any small image file can be a piece of language, but unicode cannot expect to assign codepoints or even codepoint combinations to all of them.
another even more long-term problem is future languages, be they evolutions of existing languages or (conlangs)[https://en.wikipedia.org/wiki/Conlang].
one might feel like the latter problem simply cannot be solved except by allow all communication to just embed images into text; yet, there is a much more efficient way to go about it. in an idea i'll call hashicode, raw pieces of text are a sequence of IPFS addresses, each followed by arbitrary (but delimited) sequences of bytes. the addresses would point to sandboxable (such as in wasm; although maybe not, since it's bad) programs that can read the following bytes and then provide function calls that can be called to query how to render said characters, but also which ones are whitespace, the writing direction, how to scale them, what category of character they fit in, etc.
then, both in storage and in network communication, space can be saved by merging together identical addresses and storing only one copy of each used program (perhaps reference-counted).
it is not an easy solution, but it is elegant enough, and most importantly for a language encoding format, it can represent language people are using to communicate.
it also can survive the eventual end of the last global era in a way that a centralized authority like the unicode consortium can't.
unless otherwise specified on individual pages, all posts on this website are licensed under the CC_-1 license.
unless explicitely mentioned, all content on this site was created by me; not by others nor AI.