Graut/Serialized S-expressions
- This is a reworking of a similar scheme discussed in Cifl/Protocifl/Name-encoded_trees is one possibility for a byte-string representation. That encoding is designed specifically for S-expressions with a list at the root and only byte strings and other lists as elements.
Contents
Synopsis
For various purposes, it will be useful to convert an S-expression into an unambiguous string of bytes.
A simple base encoding provides for the serialization of three kinds of element:
- A list, which is a sequence of zero or more other elements
- An atom, which in this instance is a string that contains zero or more bytes and where the value 0x00 is a valid byte and not a terminator
- A deser (or deserialization operator), which is a cue for a deserializer to treat the following list specially.
An abstract alphabet
The serialization uses an alphabet of 261 possible symbols. A symbol is any byte or one of the following non-byte values:
- An atom terminator, ‡ in this description
- A list starter, ↓ in this description
- An list terminator, ↑ in this description
- An abbreviation for ↑↓, ↕ in this description
- A deser operator, € in this description
Note that these symbols are abstract and not intended to have any association with their Unicode values.
Canonical serialization/deserialization
Consider the list
(alpha #(+ 4 8) mu #(= 1 5) omega)
We do not wish to serialize unresolved function calls, so those calls are forced. The result is something like
( alpha #( $((SYSTEM number) "parse-canonical") "12" "1" ) mu #( $((SYSTEM boolean) "parse-canonical") "false" ) omega )
The values "alpha", "mu", and "omega" evaluate to atoms. The other two items are neither lists nor atoms but extended data type values—for the types (SYSTEM number) and (SYSTEM boolean), respectively.
Consider the possibility that each type has a "parse-canonical" function (this name may or may not be fictional) that accepts a sequence of zero or more simple values, where a simple value is an atom or a list of zero or more simple values. The function ideally rejects as invalid any input that is not exactly of a unique form (for example, accepting "false" but rejecting "FALSE").
The human-readable serialization of an extended data type value is a call to its type's "parse-canonical" function. Unfortunately, this only substitutes more unresolved function calls for the previous ones (and adds a layer of variable resolution on top).
For the machine-readable serialization, we might introduce a fictional sigil, which we'll call "¢", that cannot represent function calls in general but can represent a type's "parse-canonical" call specifically. Then, the above might be represented
( alpha ¢( (SYSTEM number) "12" "1" ) mu ¢( (SYSTEM boolean) "false" ) omega )
This is sufficient for us to proceed.
Serialization
This pseudocode covers initial serialization.
function serialize_list(stream, list) output ↓ to stream for element in list if element is a list serialize_list(stream, element) else if element is an atom output bytes of element to stream output ‡ to stream else # element is extended data type value output € to stream let t = list expressing data type name of element # e.g. t = ["SYSTEM", "number"] let p = list expressing deser parameters for element # e.g. p = ["12", "1"] let d = copy of p with t prepended as a single value # e.g. d = [["SYSTEM", "number"], "12", "1"] serialize_list(stream, d) output ↑ to stream interface deser_marker property parameters is an appendable list function deserialize_list(stream) let stack be a mutable stack push [] onto stack while stream can be read let current_atom = "" read symbol from stream while symbol is a byte append symbol to current_atom read symbol from stream if symbol is ‡ let t = item at top of stack append current_atom to t else if symbol is ↓ let l = [] let t = item at top of stack append l to t push l onto stack else if symbol is ↑ pop stack assert that stack is not empty (otherwise, too many ascents) else if symbol is € read symbol from stream; assert that it is ↓ let d = new deser_marker let t = item at top of stack append d to t push d.parameters onto stack pop stack let all = top of stack pop stack assert that stack is empty assert that all contains exactly one element let l = all[0] assert that l is a list return l # Traverse result to resolve deser markers as needed
Now, consider this list:
(hello #(int 12) fizz #(false) (a "" () b c))
As before, the first thing to do is resolve the function calls, inserting the bogus ¢ sigil as needed:
(hello ¢((SYSTEM number) "12" "1") fizz ¢((SYSTEM boolean) "false") (a "" () b c))
The following is equivalent; some of the implicit quotes have been added:
("hello" ¢((SYSTEM number) "12" "1") "fizz" ¢((SYSTEM boolean) "false") ("a" "" () "b" "c"))
Serialization would go something like:
# The root ( "hello" ¢ ( ( SYSTEM number ) "12" "1" ) "fizz" ¢ ( ( SYSTEM boolean ) "false" ) ( "a" "" ( ) "b" "c" ) ) # Convert byte strings to terminated form ( hello‡ ¢ ( ( SYSTEM‡ number‡ ) 12‡ 1‡ ) fizz‡ ¢ ( ( SYSTEM‡ boolean‡ ) false‡ ) ( a‡ ‡ ( ) b‡ c‡ ) ) # Convert ascents and descents ↓ hello‡ ¢ ↓ ↓ SYSTEM‡ number‡ ↑ 12‡ 1‡ ↑ fizz‡ ¢ ↓ ↓ SYSTEM‡ boolean‡ ↑ false‡ ↑ ↓ a‡ ‡ ↓ ↑ b‡ c‡ ↑ ↑ # Convert the construction indicator ↓ hello‡ € ↓ ↓ SYSTEM‡ number‡ ↑ 12‡ 1‡ ↑ fizz‡ € ↓ ↓ SYSTEM‡ boolean‡ ↑ false‡ ↑ ↓ a‡ ‡ ↓ ↑ b‡ c‡ ↑ ↑ # The same displayed without spaces ↓hello‡€↓↓SYSTEM‡number‡↑12‡1‡↑fizz‡€↓↓SYSTEM‡boolean‡↑false‡↑↓a‡‡↓↑b‡c‡↑↑
Compression
function compress_serialized_list(symbols) with symbols # deser implicitly descends twice replace all instances of €↓↓ with € # a non-empty byte string implicitly ends at the next non-byte remove all instances of ‡ that immediately follow a byte and immediately precede ↑, ↓, or € # the root is implicitly a list remove one ↓ at beginning # the list is implicitly fully closed remove all ↑ at end # adjacent lists are partly abbreviated replace ↑↓ with ↕ return symbols
# (changes are called out with angle brackets) # Replace €↓↓ with € # (constructor implicitly starts with two descents) ↓hello‡<€↓↓>SYSTEM‡number‡↑12‡1‡↑fizz‡<€↓↓>SYSTEM‡boolean‡↑false‡↑↓a‡‡↓↑b‡c‡↑↑ ↓hello‡<€ >SYSTEM‡number‡↑12‡1‡↑fizz‡<€ >SYSTEM‡boolean‡↑false‡↑↓a‡‡↓↑b‡c‡↑↑ ↓hello‡€SYSTEM‡number‡↑12‡1‡↑fizz‡€SYSTEM‡boolean‡↑false‡↑↓a‡‡↓↑b‡c‡↑↑ # Wherever ‡ follows a byte and precedes [↑↓€], omit it as implicit. # (Note that it must actually follow a byte—this doesn't apply with an empty string.) ↓hello<‡>€SYSTEM‡number<‡>↑12‡1<‡>↑fizz<‡>€SYSTEM‡boolean<‡>↑false<‡>↑↓a‡‡↓↑b‡c<‡>↑↑ ↓hello< >€SYSTEM‡number< >↑12‡1< >↑fizz< >€SYSTEM‡boolean< >↑false< >↑↓a‡‡↓↑b‡c< >↑↑ ↓hello€SYSTEM‡number↑12‡1↑fizz€SYSTEM‡boolean↑false↑↓a‡‡↓↑b‡c↑↑ # Remove initial ↓ and final ↑+ # (root is implicitly a list and is implicitly closed at any depth at the end) <↓>hello€SYSTEM‡number↑12‡1↑fizz€SYSTEM‡boolean↑false↑↓a‡‡↓↑b‡c<↑↑> < >hello€SYSTEM‡number↑12‡1↑fizz€SYSTEM‡boolean↑false↑↓a‡‡↓↑b‡c< > hello€SYSTEM‡number↑12‡1↑fizz€SYSTEM‡boolean↑false↑↓a‡‡↓↑b‡c # Replace ↑↓ with ↕ hello€SYSTEM‡number↑12‡1↑fizz€SYSTEM‡boolean↑false<↑↓>a‡‡↓↑b‡c hello€SYSTEM‡number↑12‡1↑fizz€SYSTEM‡boolean↑false<↕ >a‡‡↓↑b‡c hello€SYSTEM‡number↑12‡1↑fizz€SYSTEM‡boolean↑false↕a‡‡↓↑b‡c
To encode exclusively with characters in the set [A-Za-z0-9_], this algorithm is provided:
Assign these values to the elements of the alphabet:
- A byte as its numeric value (0 .. 255)
- ‡ as 357 (i.e., 260 + 'a')
- ↓ as 358
- ↑ as 359
- ↕ as 360
- € as 361
Replace any value outside [A-Za-z_] with its long form, where the long form is a digit [0-9] followed by an uppercase letter [A-Z] and is determined as follows:
- The value cannot be in [A-Za-z_]; these cannot be encoded in long form.
- Add 250 - '0'; i.e., add 202. (This maps '0' to the coded value 250.)
- Take this value modulo 260.
- Let the digit index and the letter index be the value modulo 10 and the value integer-divided by 10, respectively.
- The digit and the letter are '0' + the digit index and 'A' + the letter index, respectively.
Because of the second step, the digits [0-9] are encoded as themselves followed by 'Z' (i.e. '0Z' .. '9Z').
The values assigned to the special symbols would be the long forms of lowercase letters if those were allowed as input.
This result can be compressed slightly: Replace ([0-9])Z with \1 unless it is followed by [A-Z]. A digit in the input that is not followed by [A-Z] can simply imply the 'Z'.
Naturally, decompression reverses compression:
- Replace all ↕ with ↑↓.
- Add a ↓ at the beginning.
- Add a ↑ at the end.
- This is needed for the following step to work in all cases.
- Insert a ‡ after any byte value that is followed by ↑, ↓, or €.
- Replace all € with €↓↓.
- Finally, add an additional ↑ at the end for each remaining open list.