micropython-samples/resilient
Peter Hinch b0bd198c70 Add picoweb installation guide. 2019-05-14 16:48:38 +01:00
..
README.md resilient: add link to micropython-iot. 2018-12-10 09:37:04 +00:00
application.py resilient directory added. 2018-11-30 10:34:35 +00:00
client_id.py Add picoweb installation guide. 2019-05-14 16:48:38 +01:00
client_w.py client_w remove redundant writer arg. 2018-12-01 09:42:55 +00:00
primitives.py primitives.py Add NamedTask because of possible failure mode. 2018-12-02 13:18:02 +00:00
server.py resilient directory added. 2018-11-30 10:34:35 +00:00

README.md

Resilient MicroPython WiFi code

This document is intended as a general design guide. A specific solution for many IOT applications may be found here.

The following is based on experience with the ESP8266. It aims to show how to design responsive bidirectional networking applications which are resilent: they recover from WiFi and server outages and are capable of long term running without crashing.

It is possible to write resilient code for ESP8266, but little existing code takes account of the properties of wireless links and the limitations of the hardware. On bare metal, in the absence of an OS, it is necessary to detect outages and initiate recovery to ensure that consistent program state is maintained and to avoid crashes and LmacRxBlk errors.

Radio links are inherently unreliable. They can be disrupted by sporadic RF interference, especially near the limits of range. A mobile device such as a robot can move slowly out of range and then back in again. The access point (AP) can suffer an outage as can the application code at the other end of the link. An application intended for long term running on a WiFi connected device should be able to recover from such events. Brief outages are common. In a house whose WiFi is reliable as experienced on normal devices, outages occur at a rate of around 20 per day.

The brute-force approach of a hardware watchdog timer has merit for recovering from crashes but the use of a hard reset implies the loss of program state. A hardware or software watchdog does not remove the need to perform continuous monitoring of connectivity. In the event of an outage code may continue to run feeding the watchdog; when the outage ends the ESP8266 will reconnect but the application will be in an arbitrary state. Further, sockets may be left open leading to LmacRxBlk errors and crashes.

1. Abstract

Many applications keep sockets open for long periods during which connectivity may temporarily be lost. The socket may raise an exception but this is not guaranteed: in cases of WiFi outage, loss of connectivity cannot be determined from the socket state.

Detecting an outage is vital to ensure sockets are closed and to enable code at both endpoints to initiate recovery; also to avoid crashes caused by writing to a socket whose counterpart is unavailable.

It seems that the only sure way to detect an outage is for each endpoint regularly to send data, and for the receiving endpoint to implement a read timeout.

Failure correctly to detect and recover from WiFi disruption is a major cause of unreliability in ESP8266 applications.

A demo is provided of a system where multiple ESP8266 clients communicate with a wired server with low latency full duplex links. This has run for extended periods with mutiple clients without issue. The demo is intended to illustrate the minimum requirements for a resilient system.

2. Hardware

There are numerous poor quality ESP8266 boards. There can also be issues caused by inadequate power supplies. I have found the following to be bomb-proof:

  1. Adafruit Feather Huzzah
  2. Adafruit Huzzah
  3. WeMos D1 Mini My testing was on an earlier version with the metal cased ESP8266.

3. Introduction

I became aware of the issue when running the official umqtt clients on an ESP8266. Despite being one room away from the AP the connection seldom stayed up for more than an hour or two. This in a house where WiFi as percieved by PC's and other devices is rock-solid. Subsequent tests using the code in this repo have demonstrated that brief outages are frequent.

I developed a resilient MQTT driver which is capable of recovering from WiFi outages. This is rather complex, in part because of the requirements of MQTT.

The demo code in this repo aims to establish the minimum requirements for a resilient bidirectional link between an application on a wired server and a client on an ESP8266. If a loss of connectivity occurs for any reason, communication pauses for the duration, resuming when the link is restored.

4. Application design

The two problems which must be solved are detection of an outage and ensuring that, when the outage ends, both endpoint applications can resume without loss of program state.

While an ESP8266 can detect a local loss of WiFi connectivity detection of link deterioration or of failure of the remote endpoint is more difficult.

To enable a WiFi device to cope with outages there are three approaches of increasing sophistication.

  1. Brief connection: the device code runs an infinite loop. It periodically waits for WiFi availability, connects to the remote, does its job and disconnects. The hope is that WiFi failure during the brief period of connection is unlikely. Program state is maintained. Advantage: outage detection is avoided. Drawbacks: unlikely is not impossible. The device cannot respond quickly to data from the remote
  2. Hard reset: this implies detecting in code an outage of WiFi or of the remote and triggering a hard reset. This implies a loss of program state.
  3. Resilient connection. This is the approach discussed here, where an outage is detected. The code on each endpoint recovers when connectivity resumes. Program state after recovery is consistent.

In the first two options the remote endpoint loops: it waits for a connection, acquires the data, then closes the connection.

4.1 Outage detection

At low level communication is via sockets linking two endpoints. In the case under discussion the endpoints are on physically separate hardware, at least one device being physically connected by WiFi. Each endpoint has a socket instance with both sharing a port. If one endpoint closes its socket, the other gets an exception which should be handled appropriately - especially by closing its socket.

Based on experience with the ESP8266, WiFi failures seldom cause exceptions to be thrown. Consider a nonblocking socket performing reads from a device. In an outage the socket will behave in the same way as during periods when it waits for data to arrive. During an outage, writes to a nonblocking socket will proceed normally until the ESP8266 buffers fill, provoking the dreaded LmacRxBlk:1 messages.

The isconnected() method is inadequate for detecting outages as it is a property of the interface rather than the link. If two WiFi devices are communicating, one may lose isconnected() owing to local radio conditions. If the other end tried to assess connectivity with isconnected() it would incorrectly conclude that there was no problem. Further, the method is unable to detect outages caused by program failure on the remote endpoint.

The only reliable way to detect loss of connectivity appears to be by means of timeouts, in particuar on socket reads. To keep a link open a minimum interval between data writes must be enforced. The endpoint performing the read times the interval between successful reads: if this exceeds a threshold the link is presumed to have died and a recovery process initiated.

This implies that WiFi applications which only send data cannot reliably deal with outages: to create a resilient link both ends need to wait on a read while checking for a timeout. A device whose network connection is via WiFi can sometimes get early notification of an outage with isconnected() but this is only an adjunct to the read timeout.

When a wireless device detects an outage it should ensure that the other end of the link also detects it so that sockets may be closed and connectivity may be restored when the WiFi recovers. This means that it avoid sending data for a period greater than the timeout period.

A further requirement for ESP8266 is to limit the amount of data put into a socket while the remote endpoint is down: excessive data quantities can provoke LmacRxBlk errors. I have not quantified this, but in general if N packets are sent in each timeout interval there will be a maximum pemissible size for a packet. The timeout interval will therefore be constrained by the maximum throughput required.

4.2 Timeout value

The demo uses timeouts measured in seconds, enabling prompt recovery from outages. The assumption is that all devices share a local network. If the server is on the internet longer timeouts will be required.

To preserve reliability the amount of data sent during the timeout period must be controlled. If connectivity is lost immediately after a keepalive, the loss will be undetected until the timeout has elapsed. Any data sent during that period will be buffered by the ESP8266 vendor code. Too much will lead to LmacRxBlk and probable crashes. What constitutes "excessive" is moot: experimentation is required.

4.3 Recovery

The demo system employs the following procedure for recovering from outages. The wirelessly connected client behaves as follows.

All coroutines accessing the interface are cancelled, and all open sockets are closed: this is essential to avoid LmacRxBlk:1 messages and crashes. The WiFi connection is downed.

The client then periodically attempts to reconnect to WiFi. On success it checks that local WiFi connectivity remains good for a period of double the timeout. During this period no attempt is made to send or receive data. This ensures that the remote device will also detect an outage and close its sockets. The procedure also establishes confidence that the WiFi as seen by the client is stable.

At the end of this period the client attempts to re-establish the connection, repeating the recovery procedure on failure. The server responds to the loss of connectivity by closing the connection and the sockets. It responds to the reconnection as per a new connection.

5. Demo system

This demo is of a minimal system based on nonblocking sockets. It is responsive in that each endpoint can respond immediately to a packet from its counterpart. WiFi connected clients can run indefinitely near the limit of wireless range; they automatically recover from outages of the WiFi and of the remote endpoint.

The application scenario is of multiple wirelessly connected clients, each communicating with its own application object running on a wired server. Communication is asynchronous and full-duplex (i.e. communication is bidirectional and can be initiated asynchronously by either end of the link).

A data packet is a '\n' terminated line of text. Blank lines are reserved for keepalive packets. The demo application uses JSON to serialise and exchange arbitrary Python objects.

The demo comprises the following files:

  1. server.py A server for MicroPython Unix build on a wired network connection.
  2. application.py Server-side application demo.
  3. client_w.py A client for ESP8266.
  4. client_id.py Each client must have a unique ID provided by this file. Also holds server IP and port number.
  5. primitives.py A stripped down version of asyn.py This is used by server and client. The aim is RAM saving on ESP8266.

5.1 The client

The principal purpose of the demo is to expose the client code. A more usable version could be written where the boilerplate code was separated from the application code, and I will do this. This version deliberately lays bare its workings for study.

It is started by instantiating a Client object. The constructor assumes that the ESP8266 will auto-connect to an existing network. It starts a run() coroutine which executes an infinite loop, initially waiting for a WiFi connection. It then launches reader and writer coroutines. The writer coro periodically sends a JSON encoded list, and the remote endpoint does likewise.

The client's readline() function times out after 1.5 seconds, issuing an OSError. If this occurs, the reader coro terminates clearing the .rok (reader OK) flag. This causes the run() to terminate the writer (and _keepalive) coros. When all coros have died, run() downs the WiFi for double the timeout period to ensure that the remote will detect and respond to the outage. The loop then repeats, attempting again to establish a connection.

The writer coro has similar logic ensuring that if it encounters an error the other coros will be terminated.

Both reader and writer start by instantiating a socket and connecting to the appropriate port. The socket is set to nonblocking and the unique client ID (retrieved from client_id.py) is sent to the server. This enables the server to associate a connection with a specific client.

When writer has connected to the server it starts the _keepalive method: this sends a blank line at a rate guaranteed to ensure that at least one will be sent every timeout interval.

The server also sends a blank line priodically. This serves to reset the timeout on the readline() method the client, preventing a timeout from occuring. Thus outage detection is effectively transparent: client and server applications can send data at any rate.

5.2 The application

In this demo the application is assumed to reside on the server machine. This enables a substantial simplification with the timeout, keepalive and error handling being devolved to the server.

In this demo upto 4 clients with ID's 1-4 are each served by an instance of App. However they could equally be served by different application classes. When an App is instantiated the start() coro runs which waits for the server to establish a connection with the correct client. It retrieves the connection, starts reader and writer coros, and quits.

The reader and writer coros need take no account of link status. They communicate using server readline() and write() methods which will pause for the duration of any outage.

5.3 The server

The application starts the server by launching the server.run coro. The argument defines the read timeout value which should be the same as that on the client. The value (in ms) determines the keepalive rate and the minimum downtime of an outage.

The server.run coro runs forever awaiting incoming connections. When one occurs a socket is created and a line containing the client ID is read. If the client ID is not in the dictionary of clients (Connection.connections) a Connection is instantiated for that client and placed in the dictionary. On subsequent connections of that client, the Connection will be retrieved from the dictionary. This is done by classmethod Connection.go, which assigns the socket to that Connection instance.

The server provides readline and write methods. In the event of an outage they will pause for the duration. Message transmission is not guaranteed: if an outage occurs after tansmission has commenced, the message will be lost.

In testing through hundreds of outages, no instances of corrupted or partial messages occurred. Presumably TCP/IP ensures this.