avatar

posted on 2021-11-08

psi: a universal format for structured information

representing structured information in general is not a solved problem.

in this post i outline a format tentatively called "psi", proposing what i think is the best general solution to that problem.

rationale

there are some attempts at making a general purpose information format (XML, JSON, CBOR) but psi outshines them in several respects:

note that natural numbers can be used to reference even arrays of bytes: 0 can correspond to the empty sequence, the next 256 naturals can correspond to sequences of 1 byte, the next 65536 naturals can correspond to sequences of 2 bytes, etc.

in fact, natural numbers are the mathematically natural ways to express items of a countable set; they're what those naturally map to.

while binary formats can be hard to work with, the hope would be that tools and eventually operating systems come to support this format well enough that manipulating psi values would be no harder than manipulating usual plain text files, and in fact tools to manipulate structured objects can be a lot richer.

one specific use case for which i'd like to see psi used is to overcome unicode; but others include fact-sharing, proof manipulation, algebra, note-taking, sending around concepts, program representation, etc

top-level grammar

at the top level, the units of information in psi are nodes, which are stateless values.

in BNF grammar, the syntax for psi nodes is pretty simple:

V ::= a | ( V+ )

that is, a node is either a symbol, or a sequence of at least one node; but this is far from saying everything about psi nodes.

one important aspect of them is that they can form any structure, not just hierarchies or directed acyclic graph. when a node is in V+ form (called a "physical" node), the elements forming its sequence (called "fields") can be any node, including the node itself or other nodes whose fields cycle back to this node.

the a form stands for a (countably) infinite set of "virtual" nodes, whose meaning comes from outside of the data being encoded.

together, these two possibilities allow psi to form very flexible structures, but also to reference externally meaningful ideas.

finally, as a convention, the first field of a physical node by convention indicates the type of that node. so, for example, a fraction could be expressed by a node with three fields: a virtual node referring to the concept of a fraction, followed by the numerator and the denominator.

middle-level structure

at the middle level, a psi node is represented by a pair of a module and an index.

a module can be either:

the index of a node indicates which node in the module it is.

physical modules must obey three constraints:

one great advantage of these limitations is that there is only one canonical way to construct any node, even considering cycles. this can make testing nodes for equality, but also establishing a consistent ordering between them or hashing them, easy enough.

also, by holding only a collection of nodes and every physical module that is referenced by following the fields of any of the physical modules, one holds exactly the amount of information that could be referenced by following the nodes themselves. no nodes will be missing and, more importantly, no unreachable node will be held.

this provides arbitrarily cyclical (immutable) structure manipulation, without need for a garbage collection: since the modules form a directed acyclic graph between each other, simple reference counters can be used. this is not just useful to store psi structures in memory, but also to store psi structures as IPFS Merkle DAGs with the potential for convenient reuse of modules when constructing new structures.

virtual modules on the other hand are identified by their amount of nodes, and by a large randomly-generated non-zero natural number serving as their ID. to represent a collection of external concepts, users are expected to generate a sequence of n random bits, and use 2ⁿ + the number formed by those digits, as the identifier. the 2ⁿ allows one to tell from a natural number how many random bits it has been generated with. the number of bits to generate should be chosen as a function of the expected amount of computation existing at the time of generation, to ensure collisions are avoided.

i'm not quite sure how large those randomly generated numbers should be; random UUIDs use 122 bits; i've seen some recommend using 160 bits, which seems reasonable to me.

here are some examples:

i also think it's reasonable to reserve the following:

maybe virtual module 0 with any finite size should be considered to represent the values of modulo arithmetic of that size, and would also correspond to the unit value and booleans when that size is respectively 1 or 2. i'm not sure yet.

low-level structure

it would be cool to define compact bitwise binary formats for storage; and likewise, it could be cool to define standard memory layout structures so that arbitrary programs and libraries with shared memory can move around those objects freely without having to be aware of each other's API.

in practice, some liberties can be taken for efficiency: for example, physical modules can be stored right alongside their dependencies when those are small enough, to avoid creating too many references in storage. in memory, however, this isn't recommended.

syntactic convention

for convenience, i'm proposing a simple plain text notation for psi nodes:

a top level node is either of

in addition, inside of a node:

given this, and the convention of the first field of a node indicating its type, we can already establish:

posted on 2021-11-08

CC_ -1 License unless otherwise specified on individual pages, all posts on this website are licensed under the CC_-1 license.
unless explicitely mentioned, all content on this site was created by me; not by others nor AI.