Graut/Serialized S-expressions

From HalfgeekKB
Jump to navigation Jump to search
This is a reworking of a similar scheme discussed in Cifl/Protocifl/Name-encoded_trees is one possibility for a byte-string representation. That encoding is designed specifically for S-expressions with a list at the root and only byte strings and other lists as elements.

Synopsis

For various purposes, it will be useful to convert an S-expression into an unambiguous string of bytes.

A simple base encoding provides for the serialization of three kinds of element:

  • A list, which is a sequence of zero or more other elements
  • An atom, which in this instance is a string that contains zero or more bytes and where the value 0x00 is a valid byte and not a terminator
  • A deser (or deserialization operator), which is a cue for a deserializer to treat the following list specially.

An abstract alphabet

The serialization uses an alphabet of 261 possible symbols. A symbol is any byte or one of the following non-byte values:

  • An atom terminator, ‡ in this description
  • A list starter, ↓ in this description
  • An list terminator, ↑ in this description
  • An abbreviation for ↑↓, ↕ in this description
  • A deser operator, € in this description

Note that these symbols are abstract and not intended to have any association with their Unicode values.

Canonical serialization/deserialization

Consider the list

(alpha #(+ 4 8) mu #(= 1 5) omega)

We do not wish to serialize unresolved function calls, so those calls are forced. The result is something like

(
  alpha
  #( $((SYSTEM number) "parse-canonical") "12" "1" )
  mu
  #( $((SYSTEM boolean) "parse-canonical") "false" )
  omega
)


The values "alpha", "mu", and "omega" evaluate to atoms. The other two items are neither lists nor atoms but extended data type values—for the types (SYSTEM number) and (SYSTEM boolean), respectively.

Consider the possibility that each type has a "parse-canonical" function (this name may or may not be fictional) that accepts a sequence of zero or more simple values, where a simple value is an atom or a list of zero or more simple values. The function ideally rejects as invalid any input that is not exactly of a unique form (for example, accepting "false" but rejecting "FALSE").

The human-readable serialization of an extended data type value is a call to its type's "parse-canonical" function. Unfortunately, this only substitutes more unresolved function calls for the previous ones (and adds a layer of variable resolution on top).

For the machine-readable serialization, we might introduce a fictional sigil, which we'll call "¢", that cannot represent function calls in general but can represent a type's "parse-canonical" call specifically. Then, the above might be represented

(
  alpha
  ¢( (SYSTEM number) "12" "1" )
  mu
  ¢( (SYSTEM boolean) "false" )
  omega
)

This is sufficient for us to proceed.

Serialization

This pseudocode covers initial serialization.

function serialize_list(stream, list)
  output ↓ to stream
  for element in list
    if element is a list
      serialize_list(stream, element)
    else if element is an atom
      output bytes of element to stream
      output ‡ to stream
    else # element is extended data type value
      output € to stream
      let t = list expressing data type name of element
      # e.g. t = ["SYSTEM", "number"]
      let p = list expressing deser parameters for element
      # e.g. p = ["12", "1"]
      let d = copy of p with t prepended as a single value
      # e.g. d = [["SYSTEM", "number"], "12", "1"]
      serialize_list(stream, d)
  output ↑ to stream

interface deser_marker
  property parameters is an appendable list

function deserialize_list(stream)
  let stack be a mutable stack
  push [] onto stack

  while stream can be read
    let current_atom = ""
    read symbol from stream

    while symbol is a byte
      append symbol to current_atom
      read symbol from stream

    if symbol is ‡
      let t = item at top of stack
      append current_atom to t
    else if symbol is ↓
      let l = []
      let t = item at top of stack
      append l to t
      push l onto stack
    else if symbol is ↑
      pop stack
      assert that stack is not empty (otherwise, too many ascents)
    else if symbol is €
      read symbol from stream; assert that it is ↓
      let d = new deser_marker
      let t = item at top of stack
      append d to t
      push d.parameters onto stack

  pop stack
  let all = top of stack
  pop stack
  assert that stack is empty
  assert that all contains exactly one element
  let l = all[0]
  assert that l is a list
  return l
  # Traverse result to resolve deser markers as needed    

Now, consider this list:

(hello #(int 12) fizz #(false) (a "" () b c))

As before, the first thing to do is resolve the function calls, inserting the bogus ¢ sigil as needed:

(hello ¢((SYSTEM number) "12" "1") fizz ¢((SYSTEM boolean) "false") (a "" () b c))

The following is equivalent; some of the implicit quotes have been added:

("hello" ¢((SYSTEM number) "12" "1") "fizz" ¢((SYSTEM boolean) "false") ("a" "" () "b" "c"))

Serialization would go something like:

 
  # The root
  ( "hello" ¢ ( ( SYSTEM  number ) "12" "1" ) "fizz" ¢ ( ( SYSTEM  boolean ) "false" ) ( "a" "" ( ) "b" "c" ) )
  # Convert byte strings to terminated form
  (  hello‡ ¢ ( ( SYSTEM‡ number‡ ) 12‡  1‡ )  fizz‡ ¢ ( ( SYSTEM‡ boolean‡ ) false‡ ) (  a‡  ‡ ( )  b‡  c‡ ) )
  # Convert ascents and descents
  ↓  hello‡ ¢ ↓ ↓ SYSTEM‡ number‡ ↑ 12‡  1‡ ↑  fizz‡ ¢ ↓ ↓ SYSTEM‡ boolean‡ ↑ false‡ ↑ ↓  a‡  ‡ ↓ ↑  b‡  c‡ ↑ ↑
  # Convert the construction indicator
  ↓  hello‡ € ↓ ↓ SYSTEM‡ number‡ ↑ 12‡  1‡ ↑  fizz‡ € ↓ ↓ SYSTEM‡ boolean‡ ↑ false‡ ↑ ↓  a‡  ‡ ↓ ↑  b‡  c‡ ↑ ↑
  # The same displayed without spaces
  ↓hello‡€↓↓SYSTEM‡number‡↑12‡1‡↑fizz‡€↓↓SYSTEM‡boolean‡↑false‡↑↓a‡‡↓↑b‡c‡↑↑

Compression

function compress_serialized_list(symbols)
  with symbols
    # deser implicitly descends twice
    replace all instances of €↓↓ with €
    # a non-empty byte string implicitly ends at the next non-byte
    remove all instances of ‡ that immediately follow a byte and immediately precede ↑, ↓, or €
    # the root is implicitly a list
    remove one ↓ at beginning
    # the list is implicitly fully closed
    remove all ↑ at end
    # adjacent lists are partly abbreviated
    replace ↑↓ with ↕
  return symbols

 
# (changes are called out with angle brackets)

# Replace €↓↓ with €
# (constructor implicitly starts with two descents)
↓hello‡<€↓↓>SYSTEM‡number‡↑12‡1‡↑fizz‡<€↓↓>SYSTEM‡boolean‡↑false‡↑↓a‡‡↓↑b‡c‡↑↑
↓hello‡<€  >SYSTEM‡number‡↑12‡1‡↑fizz‡<€  >SYSTEM‡boolean‡↑false‡↑↓a‡‡↓↑b‡c‡↑↑
↓hello‡€SYSTEM‡number‡↑12‡1‡↑fizz‡€SYSTEM‡boolean‡↑false‡↑↓a‡‡↓↑b‡c‡↑↑

# Wherever ‡ follows a byte and precedes [↑↓€], omit it as implicit.
# (Note that it must actually follow a byte—this doesn't apply with an empty string.)
↓hello<‡>€SYSTEM‡number<‡>↑12‡1<‡>↑fizz<‡>€SYSTEM‡boolean<‡>↑false<‡>↑↓a‡‡↓↑b‡c<‡>↑↑
↓hello< >€SYSTEM‡number< >↑12‡1< >↑fizz< >€SYSTEM‡boolean< >↑false< >↑↓a‡‡↓↑b‡c< >↑↑
↓hello€SYSTEM‡number↑12‡1↑fizz€SYSTEM‡boolean↑false↑↓a‡‡↓↑b‡c↑↑

# Remove initial ↓ and final ↑+
# (root is implicitly a list and is implicitly closed at any depth at the end)
<↓>hello€SYSTEM‡number↑12‡1↑fizz€SYSTEM‡boolean↑false↑↓a‡‡↓↑b‡c<↑↑>
< >hello€SYSTEM‡number↑12‡1↑fizz€SYSTEM‡boolean↑false↑↓a‡‡↓↑b‡c<  >
hello€SYSTEM‡number↑12‡1↑fizz€SYSTEM‡boolean↑false↑↓a‡‡↓↑b‡c

# Replace ↑↓ with ↕
hello€SYSTEM‡number↑12‡1↑fizz€SYSTEM‡boolean↑false<↑↓>a‡‡↓↑b‡c
hello€SYSTEM‡number↑12‡1↑fizz€SYSTEM‡boolean↑false<↕ >a‡‡↓↑b‡c
hello€SYSTEM‡number↑12‡1↑fizz€SYSTEM‡boolean↑false↕a‡‡↓↑b‡c

To encode exclusively with characters in the set [A-Za-z0-9_], this algorithm is provided:

Assign these values to the elements of the alphabet:

  • A byte as its numeric value (0 .. 255)
  • ‡ as 357 (i.e., 260 + 'a')
  • ↓ as 358
  • ↑ as 359
  • ↕ as 360
  • € as 361

Replace any value outside [A-Za-z_] with its long form, where the long form is a digit [0-9] followed by an uppercase letter [A-Z] and is determined as follows:

  • The value cannot be in [A-Za-z_]; these cannot be encoded in long form.
  • Add 250 - '0'; i.e., add 202. (This maps '0' to the coded value 250.)
  • Take this value modulo 260.
  • Let the digit index and the letter index be the value modulo 10 and the value integer-divided by 10, respectively.
  • The digit and the letter are '0' + the digit index and 'A' + the letter index, respectively.

Because of the second step, the digits [0-9] are encoded as themselves followed by 'Z' (i.e. '0Z' .. '9Z').

The values assigned to the special symbols would be the long forms of lowercase letters if those were allowed as input.

This result can be compressed slightly: Replace ([0-9])Z with \1 unless it is followed by [A-Z]. A digit in the input that is not followed by [A-Z] can simply imply the 'Z'.


Naturally, decompression reverses compression:

  • Replace all ↕ with ↑↓.
  • Add a ↓ at the beginning.
  • Add a ↑ at the end.
    • This is needed for the following step to work in all cases.
  • Insert a ‡ after any byte value that is followed by ↑, ↓, or €.
  • Replace all € with €↓↓.
  • Finally, add an additional ↑ at the end for each remaining open list.