2018-11-30 10:34:35 +00:00
|
|
|
# Resilient MicroPython WiFi code
|
|
|
|
|
2018-12-10 09:37:04 +00:00
|
|
|
This document is intended as a general design guide. A specific solution for
|
|
|
|
many IOT applications may be found [here](https://github.com/peterhinch/micropython-iot.git).
|
|
|
|
|
2018-11-30 10:34:35 +00:00
|
|
|
The following is based on experience with the ESP8266. It aims to show how
|
|
|
|
to design responsive bidirectional networking applications which are resilent:
|
|
|
|
they recover from WiFi and server outages and are capable of long term running
|
|
|
|
without crashing.
|
|
|
|
|
|
|
|
It is possible to write resilient code for ESP8266, but little existing code
|
|
|
|
takes account of the properties of wireless links and the limitations of the
|
|
|
|
hardware. On bare metal, in the absence of an OS, it is necessary to detect
|
|
|
|
outages and initiate recovery to ensure that consistent program state is
|
|
|
|
maintained and to avoid crashes and `LmacRxBlk` errors.
|
|
|
|
|
|
|
|
Radio links are inherently unreliable. They can be disrupted by sporadic RF
|
|
|
|
interference, especially near the limits of range. A mobile device such as a
|
|
|
|
robot can move slowly out of range and then back in again. The access point
|
|
|
|
(AP) can suffer an outage as can the application code at the other end of the
|
|
|
|
link. An application intended for long term running on a WiFi connected device
|
|
|
|
should be able to recover from such events. Brief outages are common. In a
|
|
|
|
house whose WiFi is reliable as experienced on normal devices, outages occur
|
|
|
|
at a rate of around 20 per day.
|
|
|
|
|
|
|
|
The brute-force approach of a hardware watchdog timer has merit for recovering
|
|
|
|
from crashes but the use of a hard reset implies the loss of program state. A
|
|
|
|
hardware or software watchdog does not remove the need to perform continuous
|
|
|
|
monitoring of connectivity. In the event of an outage code may continue to run
|
|
|
|
feeding the watchdog; when the outage ends the ESP8266 will reconnect but the
|
|
|
|
application will be in an arbitrary state. Further, sockets may be left open
|
|
|
|
leading to `LmacRxBlk` errors and crashes.
|
|
|
|
|
|
|
|
# 1. Abstract
|
|
|
|
|
|
|
|
Many applications keep sockets open for long periods during which connectivity
|
|
|
|
may temporarily be lost. The socket may raise an exception but this is not
|
|
|
|
guaranteed: in cases of WiFi outage, loss of connectivity cannot be determined
|
|
|
|
from the socket state.
|
|
|
|
|
|
|
|
Detecting an outage is vital to ensure sockets are closed and to enable code at
|
|
|
|
both endpoints to initiate recovery; also to avoid crashes caused by writing
|
|
|
|
to a socket whose counterpart is unavailable.
|
|
|
|
|
|
|
|
It seems that the only sure way to detect an outage is for each endpoint
|
|
|
|
regularly to send data, and for the receiving endpoint to implement a read
|
|
|
|
timeout.
|
|
|
|
|
|
|
|
Failure correctly to detect and recover from WiFi disruption is a major cause
|
|
|
|
of unreliability in ESP8266 applications.
|
|
|
|
|
|
|
|
A demo is provided of a system where multiple ESP8266 clients communicate with
|
|
|
|
a wired server with low latency full duplex links. This has run for extended
|
|
|
|
periods with mutiple clients without issue. The demo is intended to illustrate
|
|
|
|
the minimum requirements for a resilient system.
|
|
|
|
|
|
|
|
# 2. Hardware
|
|
|
|
|
|
|
|
There are numerous poor quality ESP8266 boards. There can also be issues caused
|
|
|
|
by inadequate power supplies. I have found the following to be bomb-proof:
|
|
|
|
1. [Adafruit Feather Huzzah](https://www.adafruit.com/product/2821)
|
|
|
|
2. [Adafruit Huzzah](https://www.adafruit.com/product/2471)
|
|
|
|
3. [WeMos D1 Mini](https://wiki.wemos.cc/products:d1:d1_mini) My testing was
|
|
|
|
on an earlier version with the metal cased ESP8266.
|
|
|
|
|
|
|
|
# 3. Introduction
|
|
|
|
|
|
|
|
I became aware of the issue when running the official umqtt clients on an
|
|
|
|
ESP8266. Despite being one room away from the AP the connection seldom stayed
|
|
|
|
up for more than an hour or two. This in a house where WiFi as percieved by
|
|
|
|
PC's and other devices is rock-solid. Subsequent tests using the code in this
|
|
|
|
repo have demonstrated that brief outages are frequent.
|
|
|
|
|
|
|
|
I developed a [resilient MQTT driver](https://github.com/peterhinch/micropython-mqtt.git)
|
|
|
|
which is capable of recovering from WiFi outages. This is rather complex, in
|
|
|
|
part because of the requirements of MQTT.
|
|
|
|
|
|
|
|
The demo code in this repo aims to establish the minimum requirements for a
|
|
|
|
resilient bidirectional link between an application on a wired server and a
|
|
|
|
client on an ESP8266. If a loss of connectivity occurs for any reason,
|
|
|
|
communication pauses for the duration, resuming when the link is restored.
|
|
|
|
|
|
|
|
# 4. Application design
|
|
|
|
|
|
|
|
The two problems which must be solved are detection of an outage and ensuring
|
|
|
|
that, when the outage ends, both endpoint applications can resume without loss
|
|
|
|
of program state.
|
|
|
|
|
|
|
|
While an ESP8266 can detect a local loss of WiFi connectivity detection of link
|
|
|
|
deterioration or of failure of the remote endpoint is more difficult.
|
|
|
|
|
|
|
|
To enable a WiFi device to cope with outages there are three approaches of
|
|
|
|
increasing sophistication.
|
|
|
|
|
|
|
|
1. Brief connection: the device code runs an infinite loop. It periodically
|
|
|
|
waits for WiFi availability, connects to the remote, does its job and
|
|
|
|
disconnects. The hope is that WiFi failure during the brief period of
|
|
|
|
connection is unlikely. Program state is maintained. Advantage: outage
|
|
|
|
detection is avoided. Drawbacks: unlikely is not impossible. The device cannot
|
|
|
|
respond quickly to data from the remote
|
|
|
|
2. Hard reset: this implies detecting in code an outage of WiFi or of the
|
|
|
|
remote and triggering a hard reset. This implies a loss of program state.
|
|
|
|
3. Resilient connection. This is the approach discussed here, where an outage
|
|
|
|
is detected. The code on each endpoint recovers when connectivity resumes.
|
|
|
|
Program state after recovery is consistent.
|
|
|
|
|
|
|
|
In the first two options the remote endpoint loops: it waits for a connection,
|
|
|
|
acquires the data, then closes the connection.
|
|
|
|
|
|
|
|
## 4.1 Outage detection
|
|
|
|
|
|
|
|
At low level communication is via sockets linking two endpoints. In the case
|
|
|
|
under discussion the endpoints are on physically separate hardware, at least
|
|
|
|
one device being physically connected by WiFi. Each endpoint has a socket
|
|
|
|
instance with both sharing a port. If one endpoint closes its socket, the other
|
|
|
|
gets an exception which should be handled appropriately - especially by closing
|
|
|
|
its socket.
|
|
|
|
|
|
|
|
Based on experience with the ESP8266, WiFi failures seldom cause exceptions to
|
|
|
|
be thrown. Consider a nonblocking socket performing reads from a device. In an
|
|
|
|
outage the socket will behave in the same way as during periods when it waits
|
|
|
|
for data to arrive. During an outage, writes to a nonblocking socket will
|
|
|
|
proceed normally until the ESP8266 buffers fill, provoking the dreaded
|
|
|
|
`LmacRxBlk:1` messages.
|
|
|
|
|
|
|
|
The `isconnected()` method is inadequate for detecting outages as it is a
|
|
|
|
property of the interface rather than the link. If two WiFi devices are
|
|
|
|
communicating, one may lose `isconnected()` owing to local radio conditions. If
|
|
|
|
the other end tried to assess connectivity with `isconnected()` it would
|
|
|
|
incorrectly conclude that there was no problem. Further, the method is unable
|
|
|
|
to detect outages caused by program failure on the remote endpoint.
|
|
|
|
|
|
|
|
The only reliable way to detect loss of connectivity appears to be by means of
|
|
|
|
timeouts, in particuar on socket reads. To keep a link open a minimum interval
|
|
|
|
between data writes must be enforced. The endpoint performing the read times
|
|
|
|
the interval between successful reads: if this exceeds a threshold the link is
|
|
|
|
presumed to have died and a recovery process initiated.
|
|
|
|
|
|
|
|
This implies that WiFi applications which only send data cannot reliably deal
|
|
|
|
with outages: to create a resilient link both ends need to wait on a read while
|
|
|
|
checking for a timeout. A device whose network connection is via WiFi can
|
|
|
|
sometimes get early notification of an outage with `isconnected()` but this is
|
|
|
|
only an adjunct to the read timeout.
|
|
|
|
|
|
|
|
When a wireless device detects an outage it should ensure that the other end of
|
|
|
|
the link also detects it so that sockets may be closed and connectivity may be
|
|
|
|
restored when the WiFi recovers. This means that it avoid sending data for a
|
|
|
|
period greater than the timeout period.
|
|
|
|
|
|
|
|
A further requirement for ESP8266 is to limit the amount of data put into a
|
|
|
|
socket while the remote endpoint is down: excessive data quantities can provoke
|
|
|
|
`LmacRxBlk` errors. I have not quantified this, but in general if N packets are
|
|
|
|
sent in each timeout interval there will be a maximum pemissible size for a
|
|
|
|
packet. The timeout interval will therefore be constrained by the maximum
|
|
|
|
throughput required.
|
|
|
|
|
|
|
|
## 4.2 Timeout value
|
|
|
|
|
|
|
|
The demo uses timeouts measured in seconds, enabling prompt recovery from
|
|
|
|
outages. The assumption is that all devices share a local network. If the
|
|
|
|
server is on the internet longer timeouts will be required.
|
|
|
|
|
|
|
|
To preserve reliability the amount of data sent during the timeout period must
|
|
|
|
be controlled. If connectivity is lost immediately after a keepalive, the loss
|
|
|
|
will be undetected until the timeout has elapsed. Any data sent during that
|
|
|
|
period will be buffered by the ESP8266 vendor code. Too much will lead to
|
|
|
|
`LmacRxBlk` and probable crashes. What constitutes "excessive" is moot:
|
|
|
|
experimentation is required.
|
|
|
|
|
|
|
|
## 4.3 Recovery
|
|
|
|
|
|
|
|
The demo system employs the following procedure for recovering from outages.
|
|
|
|
The wirelessly connected client behaves as follows.
|
|
|
|
|
|
|
|
All coroutines accessing the interface are cancelled, and all open sockets are
|
|
|
|
closed: this is essential to avoid `LmacRxBlk:1` messages and crashes. The WiFi
|
|
|
|
connection is downed.
|
|
|
|
|
|
|
|
The client then periodically attempts to reconnect to WiFi. On success it
|
|
|
|
checks that local WiFi connectivity remains good for a period of double the
|
|
|
|
timeout. During this period no attempt is made to send or receive data. This
|
|
|
|
ensures that the remote device will also detect an outage and close its
|
|
|
|
sockets. The procedure also establishes confidence that the WiFi as seen by the
|
|
|
|
client is stable.
|
|
|
|
|
|
|
|
At the end of this period the client attempts to re-establish the connection,
|
|
|
|
repeating the recovery procedure on failure. The server responds to the loss of
|
|
|
|
connectivity by closing the connection and the sockets. It responds to the
|
|
|
|
reconnection as per a new connection.
|
|
|
|
|
|
|
|
# 5. Demo system
|
|
|
|
|
|
|
|
This demo is of a minimal system based on nonblocking sockets. It is responsive
|
|
|
|
in that each endpoint can respond immediately to a packet from its counterpart.
|
|
|
|
WiFi connected clients can run indefinitely near the limit of wireless range;
|
|
|
|
they automatically recover from outages of the WiFi and of the remote endpoint.
|
|
|
|
|
|
|
|
The application scenario is of multiple wirelessly connected clients, each
|
|
|
|
communicating with its own application object running on a wired server.
|
|
|
|
Communication is asynchronous and full-duplex (i.e. communication is
|
|
|
|
bidirectional and can be initiated asynchronously by either end of the link).
|
|
|
|
|
|
|
|
A data packet is a '\n' terminated line of text. Blank lines are reserved for
|
|
|
|
keepalive packets. The demo application uses JSON to serialise and exchange
|
|
|
|
arbitrary Python objects.
|
|
|
|
|
|
|
|
The demo comprises the following files:
|
|
|
|
1. `server.py` A server for MicroPython Unix build on a wired network
|
|
|
|
connection.
|
|
|
|
2. `application.py` Server-side application demo.
|
|
|
|
3. `client_w.py` A client for ESP8266.
|
|
|
|
4. `client_id.py` Each client must have a unique ID provided by this file.
|
|
|
|
Also holds server IP and port number.
|
|
|
|
5. `primitives.py` A stripped down version of
|
|
|
|
[asyn.py](https://github.com/peterhinch/micropython-async/blob/master/asyn.py)
|
|
|
|
This is used by server and client. The aim is RAM saving on ESP8266.
|
|
|
|
|
|
|
|
## 5.1 The client
|
|
|
|
|
|
|
|
The principal purpose of the demo is to expose the client code. A more usable
|
|
|
|
version could be written where the boilerplate code was separated from the
|
|
|
|
application code, and I will do this. This version deliberately lays bare its
|
|
|
|
workings for study.
|
|
|
|
|
|
|
|
It is started by instantiating a `Client` object. The constructor assumes
|
|
|
|
that the ESP8266 will auto-connect to an existing network. It starts a `run()`
|
|
|
|
coroutine which executes an infinite loop, initially waiting for a WiFi
|
|
|
|
connection. It then launches `reader` and `writer` coroutines. The `writer`
|
|
|
|
coro periodically sends a JSON encoded list, and the remote endpoint does
|
|
|
|
likewise.
|
|
|
|
|
|
|
|
The client's `readline()` function times out after 1.5 seconds, issuing an
|
|
|
|
`OSError`. If this occurs, the `reader` coro terminates clearing the `.rok`
|
|
|
|
(reader OK) flag. This causes the `run()` to terminate the `writer`
|
|
|
|
(and `_keepalive`) coros. When all coros have died, `run()` downs the WiFi for
|
|
|
|
double the timeout period to ensure that the remote will detect and respond to
|
|
|
|
the outage. The loop then repeats, attempting again to establish a connection.
|
|
|
|
|
|
|
|
The `writer` coro has similar logic ensuring that if it encounters an error the
|
|
|
|
other coros will be terminated.
|
|
|
|
|
|
|
|
Both `reader` and `writer` start by instantiating a socket and connecting to
|
|
|
|
the appropriate port. The socket is set to nonblocking and the unique client ID
|
|
|
|
(retrieved from `client_id.py`) is sent to the server. This enables the server
|
|
|
|
to associate a connection with a specific client.
|
|
|
|
|
|
|
|
When `writer` has connected to the server it starts the `_keepalive` method:
|
|
|
|
this sends a blank line at a rate guaranteed to ensure that at least one will
|
|
|
|
be sent every timeout interval.
|
|
|
|
|
|
|
|
The server also sends a blank line priodically. This serves to reset the
|
|
|
|
timeout on the `readline()` method the client, preventing a timeout from
|
|
|
|
occuring. Thus outage detection is effectively transparent: client and server
|
|
|
|
applications can send data at any rate.
|
|
|
|
|
|
|
|
## 5.2 The application
|
|
|
|
|
|
|
|
In this demo the application is assumed to reside on the server machine. This
|
|
|
|
enables a substantial simplification with the timeout, keepalive and error
|
|
|
|
handling being devolved to the server.
|
|
|
|
|
|
|
|
In this demo upto 4 clients with ID's 1-4 are each served by an instance of
|
|
|
|
`App`. However they could equally be served by different application classes.
|
|
|
|
When an `App` is instantiated the `start()` coro runs which waits for the
|
|
|
|
server to establish a connection with the correct client. It retrieves the
|
|
|
|
connection, starts `reader` and `writer` coros, and quits.
|
|
|
|
|
|
|
|
The `reader` and `writer` coros need take no account of link status. They
|
|
|
|
communicate using server `readline()` and `write()` methods which will pause
|
|
|
|
for the duration of any outage.
|
|
|
|
|
|
|
|
## 5.3 The server
|
|
|
|
|
|
|
|
The application starts the server by launching the `server.run` coro. The
|
|
|
|
argument defines the read timeout value which should be the same as that on the
|
|
|
|
client. The value (in ms) determines the keepalive rate and the minimum
|
|
|
|
downtime of an outage.
|
|
|
|
|
|
|
|
The `server.run` coro runs forever awaiting incoming connections. When one
|
|
|
|
occurs a socket is created and a line containing the client ID is read. If the
|
|
|
|
client ID is not in the dictionary of clients (`Connection.connections`) a
|
|
|
|
`Connection` is instantiated for that client and placed in the dictionary. On
|
|
|
|
subsequent connections of that client, the `Connection` will be retrieved from
|
|
|
|
the dictionary. This is done by classmethod `Connection.go`, which assigns the
|
|
|
|
socket to that `Connection` instance.
|
|
|
|
|
|
|
|
The server provides `readline` and `write` methods. In the event of an outage
|
|
|
|
they will pause for the duration. Message transmission is not guaranteed: if an
|
|
|
|
outage occurs after tansmission has commenced, the message will be lost.
|
|
|
|
|
|
|
|
In testing through hundreds of outages, no instances of corrupted or partial
|
|
|
|
messages occurred. Presumably TCP/IP ensures this.
|