Add Serialisation doc.

2020-02-10 17:52:22 +00:00 · 2020-02-10 17:52:22 +00:00 · c42d12b7d9
commit c42d12b7d9
--- a/README.md
+++ b/README.md
@ -18,6 +18,12 @@ updating your local source. Now detects and builds for Pyboard D. See [docs](./f
 [Easy installation](./PICOWEB.md) guide. Simplify installing this on
 MicroPython hardware platforms under official MicroPython firmware.

+# Serialisation
+
+[A discussion](./SERIALISATION.md) of the need for serialisation and of the
+relative characteristics of four libraries available to MicroPython. Includes a
+tutorial on a Protocol Buffer library.
+
 # SSD1306

 A means of rendering multiple larger fonts to the SSD1306 OLED display. The
--- a/SERIALISATION.md
+++ b/SERIALISATION.md
@ -0,0 +1,420 @@
+# Serialisation
+
+These notes are a discussion of the serialisation libraries available to
+MicroPython plus a tutorial on the use of a library supporting Google Protocol
+Buffers (here abbreviated to `protobuf`). The aim is not to replace official
+documentation but to illustrate the relative merits and drawbacks of the
+various approaches.
+
+##### [Main readme](./README.md)
+
+# 1. The problem
+
+The need for serialisation arises whenever data must be stored on disk or
+communicated over an interface such as a socket, a UART or such interfaces as
+I2C or SPI. All these require the data to be presented as linear sequences of
+bytes. The problem is how to convert an arbitrary Python object to such a
+sequence, and how subsequently to restore the object.
+
+I am aware of four ways of achieving this, each with their own advantages and
+drawbacks. In two cases the encoded strings comprise ASCII characters, in the
+other two they are binary (bytes can take all possible values).
+
+ 1. ujson (ASCII, official)
+ 2. pickle (ASCII, official)
+ 3. ustruct (binary, official)
+ 4. protobuf [binary, unofficial](https://github.com/dogtopus/minipb)
+
+The first two are self-describing: the format includes a definition of its
+structure. This means that the decoding process can re-create the object in the
+absence of information on its structure, which may therefore change at runtime.
+Further, `ujson` and `pickle` produce human-readable byte sequences which aid
+debugging. The drawback is inefficiency: the byte sequences are relatively
+long. They are variable length. This means that the receiving process must be
+provided with a means to determine when a complete string has been received.
+
+The `ustruct` and `protobuf` solutions are binary formats: the byte sequences
+comprise binary data which is neither human readable nor self-describing.
+Binary sequences require that the receiver has information on their structure
+in order to decode them. In the case of `ustruct` sequences are of a fixed
+length which can be determined from the structure. `protobuf` sequences are
+variable length requiring handling discussed below.
+
+The benefit of binary sequences is efficiency: sequence length is closer to the
+information-theoretic minimum, compared to the ASCII options.
+
+# 2. ujson and pickle
+
+These are very similar. `ujson` is documented
+[here](http://docs.micropython.org/en/latest/library/ujson.html). `pickle` has
+identical methods so this doc may be used for both.
+
+The advantage of `ujson` is that JSON strings can be accepted by CPython and by
+other languages. The drawback is that only a subset of Python object types can
+be converted to legal JSON strings; this is a limitation of the 
+[JSON specification](http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-404.pdf).
+
+The advantage of `pickle` is that it will accept any Python object except for
+instances of user defined classes. The extremely simple source may be found in
+[the official library](https://github.com/micropython/micropython-lib/tree/master/pickle).
+The strings produced are incompatible with CPython's `pickle`, but can be
+decoded in CPython by using the MicroPython decoder. There is a
+[bug](https://github.com/micropython/micropython/issues/2280) in the
+MicroPython implementation when running under MicroPython. A workround consists
+of never encoding short strings which change repeatedly.
+
+## 2.1 Usage examples
+
+These may be copy-pasted to the MicroPython REPL.  
+Pickle:  
+```python
+import pickle
+data = {1:'test', 2:1.414, 3: [11, 12, 13]}
+s = pickle.dumps(data)
+print('Human readable data:', s)
+v = pickle.loads(s)
+print('Decoded data (partial):', v[3])
+```
+JSON. Note that dictionary keys must be strings:  
+```python
+import ujson
+data = {'1':'test', '2':1.414, '3': [11, 12, 13]}
+s = ujson.dumps(data)
+print('Human readable data:', s)
+v = ujson.loads(s)
+print('Decoded data (partial):', v['3'])
+```
+
+## 2.2 Strings are variable length
+
+In real applications the data, and hence the string length, vary at runtime.
+The receiving process needs to know when a complete string has been received or
+read from a file. In practice `ujson` and `pickle` do not include newline
+characters in encoded strings. If the data being encoded includes a newline, it
+is escaped in the string:
+```python
+import ujson
+data = {'1':b'test\nmore', '2':1.414, '3': [11, 12, 13]}
+s = ujson.dumps(data)
+print('Human readable data:', s)
+v = ujson.loads(s)
+print('Decoded data (partial):', v['1'])
+```
+If this is pasted at the REPL you will observe that the human readable data
+does not have a line break (while the decoded data does). Output:
+```
+Human readable data: {"2": 1.414, "3": [11, 12, 13], "1": "test\nmore"}
+Decoded data (partial): test
+more
+```
+Consequently encoded strings may be separated with `'\n'` before saving and
+reading may be done using `readline` methods.
+
+# 3. ustruct
+
+This is documented
+[here](http://docs.micropython.org/en/latest/library/ustruct.html). The binary
+format is efficient, but the format of a sequence cannot change at runtime and
+must be "known" to the decoding process. Records are of fixed length. If data
+is to be stored in a binary random access file, the fixed record size means
+that the offset of a given record may readily be calculated.
+
+Write a 100 record file. Each record comprises three 32-bit integers:  
+```python
+import ustruct
+fmt = 'iii'  # Record format: 3 signed ints
+rlen = ustruct.calcsize(fmt)  # Record length
+buf = bytearray(rlen)
+with open('myfile', 'wb') as f:
+    for x in range(100):
+        y = x * x
+        z = x * 10
+        ustruct.pack_into(fmt, buf, 0, x, y, z)
+        f.write(buf)
+```
+Read record no. 10 from that file:  
+```python
+import ustruct
+fmt = 'iii'
+rlen = ustruct.calcsize(fmt)  # Record length
+buf = bytearray(rlen)
+rnum = 10  # Record no.
+with open('myfile', 'rb') as f:
+    f.seek(rnum * rlen)
+    f.readinto(buf)
+    result = ustruct.unpack_from(fmt, buf)
+print(result)
+```
+Owing to the fixed record length, integers must be constrained to fit the
+length declared in the format string.
+
+Binary formats cannot use delimiters as any delimiter character may be present
+in the data - however the fixed length of `ustruct` records means that this is
+not a problem.
+
+For performance oriented applications, `ustruct` is the only serialisation
+approach which can be used in a non-allocating fashion, by using pre-allocated
+buffers as in the above example.
+
+## 3.1 Strings
+
+In `ustruct` the `s` data type is normally prefixed by a length (defaulting to
+1). This ensures that records are of fixed size, but is potentially inefficient
+as shorter strings will still occupy the same amount of space. Longer strings
+will silently be truncated. Short strings are packed with zeros.
+
+```python
+import ustruct
+fmt = 'ii30s'
+rlen = ustruct.calcsize(fmt)  # Record length
+buf = bytearray(rlen)
+ustruct.pack_into(fmt, buf, 0, 11, 22, 'the quick brown fox')
+ustruct.unpack_from(fmt, buf)
+ustruct.pack_into(fmt, buf, 0, 11, 22, 'rats')
+ustruct.unpack_from(fmt, buf)  # Packed with zeros
+ustruct.pack_into(fmt, buf, 0, 11, 22, 'the quick brown fox jumps over the lazy dog')
+ustruct.unpack_from(fmt, buf)  # Truncation
+```
+Output:
+```python
+(11, 22, b'the quick brown fox\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00')
+(11, 22, b'rats\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00')
+(11, 22, b'the quick brown fox jumps over')
+```
+
+# 4. Protocol Buffers
+
+This is a [Google standard](https://developers.google.com/protocol-buffers/)
+described in [this Wikipedia article](https://en.wikipedia.org/wiki/Protocol_Buffers).
+The aim is to provide a language independent, efficient, binary data interface.
+Records are variable length, and strings and integers of arbitrary size may be
+accommodated. The
+[implementation compatible with MicroPython](https://github.com/dogtopus/minipb)
+is a "micro" implementation: `.proto` files are not supported. However the data
+format aims to be a subset of the Google standard and claims compatibility with
+other platforms and languages.
+
+The principal benefit to developers using only CPython/MicroPython is its
+efficient support for fields whose length varies at runtime. To my knowledge it
+is the sole solution for encoding such data in a compact binary format.
+
+The following notes should be read in conjunction with the official docs. The
+notes aim to reduce the learning curve which I found a little challenging.
+
+In normal use the object transmitted by `minipb` will be a `dict` with entries
+having various predefined data types. Entries may be objects of variable length
+including strings, lists and other `dict` instances. The structure of the
+`dict` is defined using a `schema`. Sender and receiver share the `schema` with
+each script using it to instantiate the `Wire` class. The `Wire` instance is
+then repeatedly invoked to encode or decode the data.
+
+The `schema` is a `tuple` defining the structure of the data `dict`. Each
+element declares a key and its data type in an inner `tuple`. Elements of this
+inner `tuple` are strings, with element 0 defining the field's key. Subsequent
+elements define the field's data type; in most cases the data type is defined
+by a single string.
+
+## 4.1 Installation
+
+The library comprises a single file `minipb.py`. It has a dependency, the
+`logging` module `logging.py` which may be found in
+[micropython-lib](https://github.com/micropython/micropython-lib/tree/master/logging).
+On RAM constrained platforms `minipb.py` may be cross-compiled or frozen as
+bytecode for even lower RAM consumption.
+
+## 4.2 Data types
+
+These are listed in
+[the docs](https://github.com/dogtopus/minipb/wiki/Schema-Representations).
+Many of these are intended to maximise compatibility with the native data types
+of other languages. Where data will only be accessed by CPython or MicroPython,
+a subset may be used which maps onto Python data types:
+
+ 1. 'U' A UTF8 encoded string.
+ 2. 'a' A `bytes` object.
+ 3. 'b' A `bool`.
+ 4. 'f' A `float` A 32-bit float: the usual MicroPython default.
+ 5. 'z' An `int`: a signed arbitrary length integer. Efficiently encoded with
+ an ingenious algorithm.
+ 6. 'd' A double precision 64-bit float. The default on Pyboard D SF6. Also on
+ other platforms with special firmware builds.
+ 7. 'X' An empty field.
+
+## 4.2.1 Required and Optional fields
+
+If a field is prefixed with `*` it is a `required` field, otherwise it is
+optional. The field must still exist in the data: the only difference is that
+a `required` field cannot be set to `None`. Optional fields can be useful,
+notably for boolean types which can then represent three states.
+
+## 4.3 Application design
+
+The following is a minimal example which can be pasted at the REPL:
+```python
+import minipb
+
+schema = (('value', 'z'),)  # Dict will hold a single integer
+w = minipb.Wire(schema)
+
+data = {'value': 0}
+data['value'] = 150
+tx = w.encode(data)
+rx = w.decode(tx)  # received data
+print(rx)
+```
+This example glosses over the fact that in a real application the data will
+change and the length of the transmitted string `tx` will vary. The receiving
+process needs to know the length of each string. Note that a consequence of the
+binary format is that delimiters cannot be used. The length of each record must
+be established and made available to the receiver. In the case where data is
+being saved to a binary file, the file will need an index. Where data is to
+be transmitted over and interface each string should be prepended with a fixed
+length "size" field. The following example illustrates this.
+
+## 4.4 Transmitter/Receiver example
+
+These examples can't be cut and pasted at the REPL as they assume `send(n)` and
+`receive(n)` functions which access the interface.
+
+Sender example:
+```python
+import minipb
+schema = (('value', 'z'),
+          ('float', 'f'),
+          ('signed', 'z'),)
+w = minipb.Wire(schema)
+# Create a dict to hold the data
+data = {'value': 0,
+        'float': 0.0,
+        'signed' : 0,}
+while True:
+    # Update values then encode and transmit them, e.g.
+    # data['signed'] = get_signed_value()
+    tx = w.encode(data)
+    # Data lengths may change on each iteration
+    # here we encode the length in a single byte
+    dlen = len(tx).to_bytes(1, 'little')
+    send(dlen)
+    send(tx)
+```
+Receiver example:
+```python
+import minipb
+# schema must match transmitter. Typically both would import this.
+schema = (('value', 'z'),
+          ('float', 'f'),
+          ('signed', 'z'),)
+
+w = minipb.Wire(schema)
+while True:
+    dlen = receive(1)  # Data length stored in 1 byte
+    data = receive(dlen)  # Retrieve actual data
+    rx = w.decode(data)
+    # Do something with the received dict
+```
+
+## 4.5 Repeating fields
+
+This feature enables variable length lists to be encoded. List elements must
+all be of the same (declared) data type. In this example the `value` and `txt`
+fields are variable length lists denoted by the `'+'` prefix. The `value` field
+holds a list of `int` values and `txt` holds strings:  
+```python
+import minipb
+schema = (('value', '+z'),
+          ('float', 'f'),
+          ('txt', '+U'),
+          )
+w = minipb.Wire(schema)
+
+data = {'value': [150, 123, 456],
+        'float': 1.23,
+        'txt' : ['abc', 'def', 'ghi'],
+        }
+tx = w.encode(data)
+rx = w.decode(tx)
+print(rx)
+data['txt'][1] = 'the quick brown fox'  # Strings have variable length
+data['txt'].append('the end')  # List has variable length
+data['value'].append(999)  # Variable length
+tx = w.encode(data)
+rx = w.decode(tx)
+print(rx)
+```
+### 4.5.1 Packed repeating fields
+
+The author of `minipb` [does not recommend](https://github.com/dogtopus/minipb/issues/6)
+their use. Their purpose appears to be in the context of fixed-length fields
+which are outside the scope of pure Python programming.
+
+## 4.6 Message fields (nested dicts)
+
+The concept of message fields is a Protocol Buffer notion. In MicroPython
+terminology a message field contains a `dict` whose contents are defined by
+another schema. This enables nested dictionaries whose entries may be any valid
+`protobuf` data type.
+
+This is illustrated below. The example extends this by making the field a
+variable length list of `dict` objects (with the `'+['` specifier):
+```python
+import minipb
+# Schema for the nested dictionary instances
+nested_schema = (('str2', 'U'),
+                 ('num2', 'z'),)
+# Outer schema
+schema = (('number', 'z'),
+          ('string', 'U'),
+          ('nested', '+[', nested_schema, ']'),
+          ('num', 'z'),)
+w = minipb.Wire(schema)
+
+data = {
+    'number': 123,
+    'string': 'test',
+    'nested': [{'str2': 'string','num2': 888,},
+               {'str2': 'another_string', 'num2': 12345,}, ],
+    'num' : 42
+}
+tx = w.encode(data)
+rx = w.decode(tx)
+print(rx)
+print(rx['nested'][0]['str2'])  # Access inner dict instances
+print(rx['nested'][1]['str2'])
+# Appending to the nested list of dicts
+data['nested'].append({'str2': 'rats', 'num2':999})
+tx = w.encode(data)
+rx = w.decode(tx)
+print(rx)
+print(rx['nested'][2]['str2'])  # Access inner dict instances
+```
+
+### 4.6.1 Recursion
+
+This is surely overkill in most MicroPython applications, but for the sake of
+completeness message fields can be recursive:
+```python
+import minipb
+inner_schema = (('str2', 'U'),
+                ('num2', 'z'),)
+
+nested_schema = (('inner', '+[', inner_schema, ']'),)
+
+schema = (('number', 'z'),
+          ('string', 'U'),
+          ('nested', '[', nested_schema, ']'),
+          ('num', 'z'),)
+
+w = minipb.Wire(schema)
+
+data = {
+       'number': 123,
+       'string': 'test',
+       'nested': {'inner':({'str2': 'string', 'num2': 888,},
+                           {'str2': 'another_string','num2': 12345,}, ),},
+        'num' : 42
+        }
+tx = w.encode(data)
+rx = w.decode(tx)
+print(rx)
+print(rx['nested']['inner'][0]['str2'])
+```