avatar

posted on 2020-12-29

A canonical and efficient byte-encoding for ints

the internet has a bunch of efficient byte-encodings for fixed-length integers, from LEB128 to arguably UTF-8; but none of them (except the trivial encoding of always encoding an integer as its fixed byte sizeof) seem to be canonical.

what i mean by canonical is that every integer has only one possible representation, and every representation means a single integer.

after some work, i've devised a mildly convoluted but, as far as i can tell, truly canonical encoding for int16, int32, and int64.

the formula i rely on is that 2ⁿ = 1 + (2¹ + … + 2ⁿ¯¹) = 2ⁱ + (2ⁱ + … 2ⁿ¯¹); so, to cover the entire, say, 2³² space of int32, i need to fit neatly into bytes variable bit patterns of length i through 31 (and have two patterns for i).

here are the bit patterns for int16 and int32 (int64, being larger, is at the bottom of the page).

each line mentions the number of variable bits you can check that the sum of two to the power of each of those account for every possible value; for example, for int16, 2⁷ + (2⁷ + … + 2¹⁵) = 2¹⁶

int16:
   7: 0.......
*  7: 10000000 0.......
   8: 10000001 ........
   9: 1000001. ........
  10: 100001.. ........
  11: 10001... ........
  12: 1001.... ........
  13: 101..... ........
  14: 11...... ........
* 15: 10000000 1....... ........

int32:
   6: 00......
*  6: 01000000 00......
*  7: 10000000 0.......
   8: 01000001 ........
   9: 0100001. ........
  10: 010001.. ........
  11: 01001... ........
  12: 0101.... ........
  13: 011..... ........
* 14: 01000000 01...... ........
* 15: 10000000 1....... ........
  16: 10000001 ........ ........
  17: 1000001. ........ ........
  18: 100001.. ........ ........
  19: 10001... ........ ........
  20: 1001.... ........ ........
  21: 101..... ........ ........
* 22: 01000000 10...... ........(×2)
* 23: 11000000 0....... ........(×2)
  24: 11000001 ........ ........(×2)
  25: 1100001. ........ ........(×2)
  26: 110001.. ........ ........(×2)
  27: 11001... ........ ........(×2)
  28: 1101.... ........ ........(×2)
  29: 111..... ........ ........(×2)
* 30: 01000000 11...... ........(×3)
* 31: 11000000 1....... ........(×3)

with K = 1 for int64, K = 2 for int32, K = 3 for int64:

when the lower 8-K bits of the first byte are 0 (a pattern indicated with a * next to the line), then the upper K bits of the first byte and the following 1 to 3 bits of the second byte (depending on the value of K) determine the amount of extra number-encoding bytes, while the remaining bits the second byte are the initial bits of the number.

when the lower 8-K bits are not 0, then the K upper bits indicate the number of number-encoding bytes after the first byte, and the bits after the first 1 in the lower 8-K bits are the initial bits of the number.

my explanation is probably not very clear, but the pattern should be visible if you look at it enough; just keep in mind that lines tagged with a * work different from those not tagged that way, and that the first K bits are a special tag.

to efficiently encode a signed integer (optimizing for values near 0), encode:

below is the bit patterns for int64:

int64:
   5: 000.....
*  5: 00100000 000.....
*  6: 01000000 00......
*  7: 10000000 0.......
   8: 00100001 ........
   9: 0010001. ........
  10: 001001.. ........
  11: 00101... ........
  12: 0011.... ........
* 13: 00100000 001..... ........
* 14: 01000000 01...... ........
* 15: 10000000 1....... ........
  16: 01000001 ........ ........
  17: 0100001. ........ ........
  18: 010001.. ........ ........
  19: 01001... ........ ........
  20: 0101.... ........ ........
* 21: 00100000 010..... ........(×2)
* 22: 01000000 10...... ........(×2)
* 23: 10100000 0....... ........(×2)
  24: 01100001 ........ ........(×2)
  25: 0110001. ........ ........(×2)
  26: 011001.. ........ ........(×2)
  27: 01101... ........ ........(×2)
  28: 0111.... ........ ........(×2)
* 29: 00100000 011..... ........(×3)
* 30: 01000000 11...... ........(×3)
* 31: 10100000 1....... ........(×3)
  32: 10000001 ........ ........(×3)
  33: 1000001. ........ ........(×3)
  34: 100001.. ........ ........(×3)
  35: 10001... ........ ........(×3)
  36: 1001.... ........ ........(×3)
* 37: 00100000 100..... ........(×4)
* 38: 01100000 00...... ........(×4)
* 39: 11000000 0....... ........(×4)
  40: 10100001 ........ ........(×4)
  41: 1010001. ........ ........(×4)
  42: 101001.. ........ ........(×4)
  43: 10101... ........ ........(×4)
  44: 1011.... ........ ........(×4)
* 45: 00100000 101..... ........(×5)
* 46: 01100000 01...... ........(×5)
* 47: 11000000 1....... ........(×5)
  48: 11000001 ........ ........(×5)
  49: 1100001. ........ ........(×5)
  50: 110001.. ........ ........(×5)
  51: 11001... ........ ........(×5)
  52: 1101.... ........ ........(×5)
* 53: 00100000 110..... ........(×6)
* 54: 01100000 10...... ........(×6)
* 55: 11100000 0....... ........(×6)
  56: 11100001 ........ ........(×6)
  57: 1110001. ........ ........(×6)
  58: 111001.. ........ ........(×6)
  59: 11101... ........ ........(×6)
  60: 1111.... ........ ........(×6)
* 61: 00100000 111..... ........(×7)
* 62: 01100000 11...... ........(×7)
* 63: 11100000 1....... ........(×7)

posted on 2020-12-29

CC_ -1 License unless otherwise specified on individual pages, all posts on this website are licensed under the CC_-1 license.
unless explicitely mentioned, all content on this site was created by me; not by others nor AI.