Vladimir Sedach

Have Emacs - Will Hack

November 16, 2010

Character encoding is about algorithms, not datastructures

Topic: Software engineering

Both SBCL and Clozure Common Lisp represent characters using 4 bytes. There has been a lot of discussion about this already. This post will try to offer more insight into how you can apply this to your Lisp applications.

The one thing most people seem to agree on is that UTF-16 has a lot of problems (both CLISP and CMUCL use UCS-2, from which UTF-16 is derived, as their internal character representation).

The important thing about UTF-32 vs. UTF-8 and UTF-16 is that it is not primarily a question of string size, but of algorithms.

Variable-length encodings work perfectly fine for stream algorithms. But string algorithms are written on the assumption of constant-time random access and being able to set parts of a string to certain values without consing. These assumptions cannot be satisfied when variable-length encodings are used, and most string algorithms would not run with any level of acceptable performance.

What about immutable strings? Random access is still not constant-time for variable-length encodings, and all the mutator operations are gone. In effect most immutable string algorithms actually end up being stream algorithms that cons a lot.

Chances are, you are already using stream algorithms on your strings even if you are not aware of it. Any kind of search over a string really treats that string as a stream. If you are doing string concatenation, you are really treating your strings as though they were immutable - consider using with-output-to-string to cut down on consing and to simplify your code.

One thing that is special about UTF-8 is that it is the de-facto character encoding standard of the web. When you read UTF-8 into a Lisp string, what you are really doing is decoding and copying each character.

Most web application patterns are based around searching for and extracting some string from the HTTP request, and either using that string as a retrieval key in a database or splicing it together with some other strings to form the reply. All these actions can be modeled in terms of streams.

One of the things that makes John Fremlin's tpd2 fast is that it dispenses with the step of decoding and copying the incoming UTF-8 data into a Lisp string (this is also what antiweb does). Using some compiler macros and the cl-irregsexp library, all the searching and templating is done on byte arrays that hold the UTF-8 encoded data (Lisp string literals are converted to UTF-8 byte arrays by compiler macros). The result is a UTF-8 byte array that can be sent directly back to the web browser, bypassing another character encoding and copying step.

I read somewhere that the Exokernel operating system permitted extending this concept further down the software stack by allowing applications to send back pre-formatted TCP/IP packets.

In addition to skipping the overhead of copying and re-encoding characters, working on UTF-8 byte sequences directly means you can use algorithms that depend on working with a small alphabet set to achieve greater performance (an example of this is the Boyer–Moore–Horspool string searching algorithm).