representing structured information in general is not a solved problem.
in this post i outline a format tentatively called "psi", proposing what i think is the best general solution to that problem.
there are some attempts at making a general purpose information format (XML, JSON, CBOR) but psi outshines them in several respects:
note that natural numbers can be used to reference even arrays of bytes: 0 can correspond to the empty sequence, the next 256 naturals can correspond to sequences of 1 byte, the next 65536 naturals can correspond to sequences of 2 bytes, etc.
in fact, natural numbers are the mathematically natural ways to express items of a countable set; they're what those naturally map to.
while binary formats can be hard to work with, the hope would be that tools and eventually operating systems come to support this format well enough that manipulating psi values would be no harder than manipulating usual plain text files, and in fact tools to manipulate structured objects can be a lot richer.
one specific use case for which i'd like to see psi used is to overcome unicode; but others include fact-sharing, proof manipulation, algebra, note-taking, sending around concepts, program representation, etc
at the top level, the units of information in psi are nodes, which are stateless values.
in BNF grammar, the syntax for psi nodes is pretty simple:
V ::= a | ( V+ )
that is, a node is either a symbol, or a sequence of at least one node; but this is far from saying everything about psi nodes.
one important aspect of them is that they can form any structure, not just hierarchies or directed acyclic graph. when a node is in
V+ form (called a "physical" node), the elements forming its sequence (called "fields") can be any node, including the node itself or other nodes whose fields cycle back to this node.
a form stands for a (countably) infinite set of "virtual" nodes, whose meaning comes from outside of the data being encoded.
together, these two possibilities allow psi to form very flexible structures, but also to reference externally meaningful ideas.
finally, as a convention, the first field of a physical node by convention indicates the type of that node. so, for example, a fraction could be expressed by a node with three fields: a virtual node referring to the concept of a fraction, followed by the numerator and the denominator.
at the middle level, a psi node is represented by a pair of a module and an index.
a module can be either:
the index of a node indicates which node in the module it is.
physical modules must obey three constraints:
one great advantage of these limitations is that there is only one canonical way to construct any node, even considering cycles. this can make testing nodes for equality, but also establishing a consistent ordering between them or hashing them, easy enough.
also, by holding only a collection of nodes and every physical module that is referenced by following the fields of any of the physical modules, one holds exactly the amount of information that could be referenced by following the nodes themselves. no nodes will be missing and, more importantly, no unreachable node will be held.
this provides arbitrarily cyclical (immutable) structure manipulation, without need for a garbage collection: since the modules form a directed acyclic graph between each other, simple reference counters can be used. this is not just useful to store psi structures in memory, but also to store psi structures as IPFS Merkle DAGs with the potential for convenient reuse of modules when constructing new structures.
virtual modules on the other hand are identified by their amount of nodes, and by a large randomly-generated non-zero natural number serving as their ID. to represent a collection of external concepts, users are expected to generate a sequence of n random bits, and use 2ⁿ + the number formed by those digits, as the identifier. the 2ⁿ allows one to tell from a natural number how many random bits it has been generated with. the number of bits to generate should be chosen as a function of the expected amount of computation existing at the time of generation, to ensure collisions are avoided.
here are some examples:
2595793418667508392727470919165783019666420633976with 128 nodes, represents the set of all 128 ASCII codepoints
1757914673057156744813439560490372169016067640867with 1,111,998 nodes represents the set of Unicode characters, in order of increasing codepoint values (or maybe private use areas should not be counted?)
i also think it's reasonable to reserve the following:
0with size 1 consists of the unit
0with size 2 represents the logic values false and true, assigned respectively to indices 0 and 1
0with an infinite size represents the set of natural numbers, each mapped to itself as index.
maybe virtual module
0 with any finite size should be considered to represent the values of modulo arithmetic of that size, and would also correspond to the unit value and booleans when that size is respectively 1 or 2. i'm not sure yet.
it would be cool to define compact bitwise binary formats for storage; and likewise, it could be cool to define standard memory layout structures so that arbitrary programs and libraries with shared memory can move around those objects freely without having to be aware of each other's API.
in practice, some liberties can be taken for efficiency: for example, physical modules can be stored right alongside their dependencies when those are small enough, to avoid creating too many references in storage. in memory, however, this isn't recommended.
for convenience, i'm proposing a simple plain text notation for psi nodes:
a top level node is either of
(x1 x2 x3…)where
#93418#128#97is value 97 in virtual module 93418 of size 128
#93418#97is value 97 in infinite virtual module 93418
#93418is syntactic sugar for
[…]xall alias the node to the name
x. the node being aliased can be a virtual node, but then the aliasing name must be on the left. in addition, if
#s are present between the name and the node being aliased, they indicate a local (as opposed to global) scope, "climbing back" as many nodes as there are
#s, minus one; so
x#(a)is accessible only in the current node and its sub-nodes, and
x##(a)is accessible in the parent node and its sub-nodes.
xrefers to either an alias or, if none, an external node
in addition, inside of a node:
#means the node being listed itself,
##its parent (in the syntactic hierarchy),
###its parent's parent, etc
[x1 x2 x3…]inside a node doesn't actually add that node as a field, but instead allows that node to float freely; such a node must contain at least one
given this, and the convention of the first field of a node indicating its type, we can already establish:
(#)is "type": the type of types
((#))is a type, with no other information being provided
(((#)) x1 x2 x3…)could perhaps be a simple collection of values
(t (t (t (t ####))))is a circular linked list which, as indicated above, reduces to
(t #0#1 (t #0#2 (t #0#1 (t #0#2 ####))))reduces to
(t #0#1 (t #0#2 ##))and
(a (b) (b) (c ##) (c ##))reduces to
(a x(b) x y(c ##) y)