micropython/docs/develop/qstr.rst

.. _qstr:

MicroPython string interning
============================

MicroPython uses `string interning`_ to save both RAM and ROM.  This avoids
having to store duplicate copies of the same string.  Primarily, this applies to
identifiers in your code, as something like a function or variable name is very
likely to appear in multiple places in the code.  In MicroPython an interned
string is called a QSTR (uniQue STRing).

A QSTR value (with type ``qstr``) is a index into a linked list of QSTR pools.
QSTRs store their length and a hash of their contents for fast comparison during
the de-duplication process.  All bytecode operations that work with strings use
a QSTR argument.

Compile-time QSTR generation
----------------------------

In the MicroPython C code, any strings that should be interned in the final
firmware are written as ``MP_QSTR_Foo``.  At compile time this will evaluate to
a ``qstr`` value that points to the index of ``"Foo"`` in the QSTR pool.

A multi-step process in the ``Makefile`` makes this work.  In summary this
process has three parts:

1. Find all ``MP_QSTR_Foo`` tokens in the code.

2. Generate a static QSTR pool containing all the string data (including lengths
   and hashes).

3. Replace all ``MP_QSTR_Foo`` (via the preprocessor) with their corresponding
   index.

``MP_QSTR_Foo`` tokens are searched for in two sources:

1. All files referenced in ``$(SRC_QSTR)``.  This is all C code (i.e. ``py``,
   ``extmod``, ``ports/stm32``) but not including third-party code such as
   ``lib``.

2. Additional ``$(QSTR_GLOBAL_DEPENDENCIES)`` (which includes ``mpconfig*.h``).

*Note:* ``frozen_mpy.c`` (generated by mpy-tool.py) has its own QSTR generation
and pool.

Some additional strings that can't be expressed using the ``MP_QSTR_Foo`` syntax
(e.g. they contain non-alphanumeric characters) are explicitly provided in
``qstrdefs.h`` and ``qstrdefsport.h`` via the ``$(QSTR_DEFS)`` variable.

Processing happens in the following stages:

1. ``qstr.i.last`` is the concatenation of putting every single input file
   through the C pre-processor.  This means that any conditionally disabled code
   will be removed, and macros expanded.  This means we don't add strings to the
   pool that won't be used in the final firmware.  Because at this stage (thanks
   to the ``NO_QSTR`` macro added by ``QSTR_GEN_CFLAGS``) there is no
   definition for ``MP_QSTR_Foo`` it passes through this stage unaffected.  This
   file also includes comments from the preprocessor that include line number
   information.  Note that this step only uses files that have changed, which
   means that ``qstr.i.last`` will only contain data from files that have
   changed since the last compile.

2. ``qstr.split`` is an empty file created after running ``makeqstrdefs.py split``
   on qstr.i.last. It's just used as a dependency to indicate that the step ran.
   This script outputs one file per input C file,  ``genhdr/qstr/...file.c.qstr``,
   which contains only the matched QSTRs. Each QSTR is printed as ``Q(Foo)``.
   This step is necessary to combine the existing files with the new data
   generated from the incremental update in ``qstr.i.last``.

3. ``qstrdefs.collected.h`` is the output of concatenating ``genhdr/qstr/*``
   using ``makeqstrdefs.py cat``.  This is now the full set of ``MP_QSTR_Foo``'s
   found in the code, now formatted as ``Q(Foo)``, one-per-line, with duplicates.
   This file is only updated if the set of qstrs has changed.  A hash of the QSTR
   data is written to another file (``qstrdefs.collected.h.hash``) which allows
   it to track changes across builds.

4. Generate an enumeration, each entry of which maps a ``MP_QSTR_Foo`` to it's corresponding index.
   It concatenates ``qstrdefs.collected.h`` with ``qstrdefs*.h``, then it transforms
   each line from ``Q(Foo)`` to ``"Q(Foo)"`` so they pass through the preprocessor
   unchanged.  Then the preprocessor is used to deal with any conditional
   compilation in ``qstrdefs*.h``.  Then the transformation is undone back to
   ``Q(Foo)``, and saved as ``qstrdefs.preprocessed.h``.

5. ``qstrdefs.generated.h`` is the output of ``makeqstrdata.py``.  For each
   ``Q(Foo)`` in qstrdefs.preprocessed.h (plus some extra hard-coded ones), it outputs
   ``QDEF(MP_QSTR_Foo, (const byte*)"hash" "Foo")``.

Then in the main compile, two things happen with ``qstrdefs.generated.h``:

1. In qstr.h, each QDEF becomes an entry in an enum, which makes ``MP_QSTR_Foo``
   available to code and equal to the index of that string in the QSTR table.

2. In qstr.c, the actual QSTR data table is generated as elements of the
   ``mp_qstr_const_pool->qstrs``.

.. _`string interning`: https://en.wikipedia.org/wiki/String_interning

Run-time QSTR generation
------------------------

Additional QSTR pools can be created at runtime so that strings can be added to
them. For example, the code::

  foo[x] = 3

Will need to create a QSTR for the value of ``x`` so it can be used by the
"load attr" bytecode.

Also, when compiling Python code, identifiers and literals need to have QSTRs
created.  Note: only literals shorter than 10 characters become QSTRs.  This is
because a regular string on the heap always takes up a minimum of 16 bytes (one
GC block), whereas QSTRs allow them to be packed more efficiently into the pool.

QSTR pools (and the underlying "chunks" that store the string data) are allocated
on-demand on the heap with a minimum size.