tammy's blog about

AI alignment,
utopia,
anthropics,
and more;

representing structured information in general is not a solved problem.

in this post i outline a format tentatively called "psi", proposing what i think is the best general solution to that problem.

there are some attempts at making a general purpose information format (XML, JSON, CBOR) but psi outshines them in several respects:

- psi is designed to be very simple
- given a structure of modules, new values can be created that reference values deep inside other modules easily (to annotate or simply reuse pieces of another psi structure)
- the format isn't burdened with either a scarce amount of privileged first-class types, nor a central authority tasked with assigning small integer IDs to specific meanings; long random IDs give everyone the same ability to declare concepts or collections thereof
- it is designed to be easily transmitted in a variety of substrates, and binary encodings are encouraged
- it isn't bogged down in specifying endianness or specific integer sizes (typically 8, 16, 32, and 64 bits); instead, arbitrarily large natural numbers are used for everything, and specific platforms can have their own limitations in the size of indices and modules they support.

note that natural numbers can be used to reference even arrays of bytes: 0 can correspond to the empty sequence, the next 256 naturals can correspond to sequences of 1 byte, the next 65536 naturals can correspond to sequences of 2 bytes, etc.

in fact, natural numbers are the mathematically natural ways to express items of a countable set; they're what those naturally map to.

while binary formats can be hard to work with, the hope would be that tools and eventually operating systems come to support this format well enough that manipulating psi values would be no harder than manipulating usual plain text files, and in fact tools to manipulate structured objects can be a lot richer.

one specific use case for which i'd like to see psi used is to overcome unicode; but others include fact-sharing, proof manipulation, algebra, note-taking, sending around concepts, program representation, etc

at the top level, the units of information in psi are *nodes*, which are stateless values.

in BNF grammar, the syntax for psi nodes is pretty simple:

`V ::= a | ( V+ )`

that is, a node is either a symbol, or a sequence of at least one node; but this is far from saying everything about psi nodes.

one important aspect of them is that they can form any structure, not just hierarchies or directed acyclic graph. when a node is in `V+`

form (called a "physical" node), the elements forming its sequence (called "fields") can be any node, including the node itself or other nodes whose fields cycle back to this node.

the `a`

form stands for a (countably) infinite set of "virtual" nodes, whose meaning comes from outside of the data being encoded.

together, these two possibilities allow psi to form very flexible structures, but also to reference externally meaningful ideas.

finally, as a convention, the first field of a physical node by convention indicates the type of that node. so, for example, a fraction could be expressed by a node with three fields: a virtual node referring to the concept of a fraction, followed by the numerator and the denominator.

at the middle level, a psi node is represented by a pair of a **module** and an **index**.

a module can be either:

- a "virtual module" containing either a finite non-zero amount, or a countably infinite amount, of virtual nodes
- a "physical module" containing a finite collection of physical nodes

the index of a node indicates which node in the module it is.

physical modules must obey three constraints:

- they must be the result of graph condensation: if we draw up the directed graph from nodes to other nodes they reference through fields, then every node is a strongly connected component, while the set of physical modules themselves form a directed acyclic graph.
- nodes inside them must be "unified" such that if exploring two nodes yields the same observable structure, these two nodes must be considered equal and referenced by the same index. a result of this is that for example cycles where every entry point is isomorphic to any other are "collapsed" into a single node. (the reason for this is that, if nodes weren't unified in this way, there could be weird effects where two people creating the same structure could end up disagreeing on which node is equal to which based on implementation details; reducing equality to observational equality is just simpler)
- finally, after these two steps, the nodes can be sorted using a standardized comparison method (not explained here, but not hard to design).

one great advantage of these limitations is that there is only one canonical way to construct any node, even considering cycles. this can make testing nodes for equality, but also establishing a consistent ordering between them or hashing them, easy enough.

also, by holding only a collection of nodes and every physical module that is referenced by following the fields of any of the physical *modules*, one holds exactly the amount of information that could be referenced by following the nodes themselves. no nodes will be missing and, more importantly, no unreachable node will be held.

this provides arbitrarily cyclical (immutable) structure manipulation, without need for a garbage collection: since the modules form a directed acyclic graph between each other, simple reference counters can be used. this is not just useful to store psi structures in memory, but also to store psi structures as IPFS Merkle DAGs with the potential for convenient reuse of modules when constructing new structures.

virtual modules on the other hand are identified by their amount of nodes, and by a large randomly-generated non-zero natural number serving as their ID. to represent a collection of external concepts, users are expected to generate a sequence of n random bits, and use 2ⁿ + the number formed by those digits, as the identifier. the 2ⁿ allows one to tell from a natural number how many random bits it has been generated with. the number of bits to generate should be chosen as a function of the expected amount of computation existing at the time of generation, to ensure collisions are avoided.

i'm not quite sure how large those randomly generated numbers should be; random UUIDs use 122 bits; i've seen some recommend using 160 bits, which seems reasonable to me.

here are some examples:

- let's say the virtual module number
`2595793418667508392727470919165783019666420633976`

with 128 nodes, represents the set of all 128 ASCII codepoints - in the same way, let's say the number
`1757914673057156744813439560490372169016067640867`

with 1,111,998 nodes represents the set of Unicode characters, in order of increasing codepoint values (or maybe private use areas should not be counted?)

i also think it's reasonable to reserve the following:

- virtual module
`0`

with size 1 consists of the unit - virtual module
`0`

with size 2 represents the logic values false and true, assigned respectively to indices 0 and 1 - virtual module
`0`

with an infinite size represents the set of natural numbers, each mapped to itself as index.

maybe virtual module `0`

with any finite size should be considered to represent the values of modulo arithmetic of that size, and would also correspond to the unit value and booleans when that size is respectively 1 or 2. i'm not sure yet.

it would be cool to define compact bitwise binary formats for storage; and likewise, it could be cool to define standard *memory layout structures* so that arbitrary programs and libraries with shared memory can move around those objects freely without having to be aware of each other's API.

in practice, some liberties can be taken for efficiency: for example, physical modules can be stored right alongside their dependencies when those are small enough, to avoid creating too many references in storage. in memory, however, this isn't recommended.

for convenience, i'm proposing a simple plain text notation for psi nodes:

a top level node is either of

`(x1 x2 x3…)`

where`xi`

are nodes`#93418#128#97`

is value 97 in virtual module 93418 of size 128`#93418#97`

is value 97 in infinite virtual module 93418`#93418`

is syntactic sugar for`#93418#1#0`

`x(…)`

,`(…)x`

,`x[…]`

,`[…]x`

all alias the node to the name`x`

. the node being aliased can be a virtual node, but then the aliasing name*must*be on the left. in addition, if`#`

s are present between the name and the node being aliased, they indicate a local (as opposed to global) scope, "climbing back" as many nodes as there are`#`

s, minus one; so`x(a)`

is global,`x#(a)`

is accessible only in the current node and its sub-nodes, and`x##(a)`

is accessible in the parent node and its sub-nodes.- a standalone name like
`x`

refers to either an alias or, if none, an external node

in addition, inside of a node:

`#`

means the node being listed itself,`##`

its parent (in the syntactic hierarchy),`###`

its parent's parent, etc`[x1 x2 x3…]`

inside a node doesn't actually add that node as a field, but instead allows that node to float freely; such a node*must*contain at least one`#`

field

given this, and the convention of the first field of a node indicating its type, we can already establish:

`(#)`

is "type": the type of types`((#))`

is a type, with no other information being provided`(((#)) x1 x2 x3…)`

could perhaps be a simple collection of values`(t (t (t (t ####))))`

is a circular linked list which, as indicated above, reduces to`(t #)`

- likewise,
`(t #0#1 (t #0#2 (t #0#1 (t #0#2 ####))))`

reduces to`(t #0#1 (t #0#2 ##))`

and`(a (b) (b) (c ##) (c ##))`

reduces to`(a x(b) x y(c ##) y)`

unless otherwise specified on individual pages, all posts on this website are licensed under the CC_-1 license.

unless explicitely mentioned, all content on this site was created by me; not by others nor AI.