← History

Strings and Unicode: bytes, code points, and graphemes across eight languages

What does length even mean? Eight languages, three answers - bytes, code points, and graphemes.

RustGoPythonHaskellOCamlPerlRakuGuji

Ask eight languages how long the string café is and you can get three different numbers. The string is four user-perceived characters. But if the é is a precomposed code point (U+00E9) it is four code points and five UTF-8 bytes; if it is e followed by a combining acute accent (U+0301) it is five code points and six bytes. Every honest answer to "how long is this string?" first has to answer "length in what unit?" - bytes, Unicode scalar values (code points), or grapheme clusters (the things humans call characters). The languages below stake out strikingly different positions on that question.

Bytes by default: Rust and Go

Rust and Go both store text as UTF-8 bytes and both make that physical layout visible, but they expose it differently. A Rust String / &str is guaranteed valid UTF-8. len() returns the byte count, and indexing by byte offset is allowed only on character boundaries - slice through the middle of a multi-byte sequence and you panic. To get code points you iterate explicitly:

let s = "café";          // precomposed
assert_eq!(s.len(), 5);              // bytes
assert_eq!(s.chars().count(), 4);    // char = one Unicode scalar value

A Rust char is exactly one scalar value (4 bytes wide), never a grapheme. Clusters live in the unicode-segmentation crate, deliberately outside std.

Go is looser. A string is an immutable, possibly-invalid byte slice; len(s) is bytes, and a plain for i := range s decodes UTF-8 and yields rune values (Go's name for a code point) along with their byte offsets. Iterating by index gives you raw bytes; iterating by range gives you runes. Invalid bytes decode to U+FFFD rather than erroring, which is convenient and occasionally a footgun.

Code points as the unit: Python

Python 3 picks the code point as its atom and hides the storage. A str is a sequence of Unicode scalar values; CPython uses a flexible internal representation (latin-1, UCS-2, or UCS-4) but you never see it. len counts code points, and you cross into bytes only by explicit .encode():

s = "café"                              # 4 code points
import unicodedata
d = unicodedata.normalize("NFD", s)     # 5 code points
print(len(s), len(d))                   # 4 5
print(len(s.encode("utf-8")))           # 5 bytes
print(len("🇫🇷"))                        # 2  (two regional indicators)

That last line is the recurring lesson: Python's "character" is a code point, so a flag emoji is two. Grapheme clustering needs a third-party library such as regex or grapheme. Indexing is O(1) by code point because the buffer is fixed width, which is exactly why CPython refuses to store UTF-8 directly.

Bytes that learn to be characters: Perl and Raku

Perl is the cautionary tale and the redemption arc. A Perl scalar is bytes unless you tell it otherwise; use utf8 marks the source as UTF-8, and the internal UTF8 flag decides whether length counts bytes or characters. The same literal can be four or five long depending on that flag:

use utf8;
my $s = "café";                 # 4 characters
say length $s;                  # 4
my $flag = "🇫🇷";
say length $flag;               # 2 code points
my $g = () = $flag =~ /\X/g;
say $g;                         # 1 grapheme cluster, via \X

Crucially Perl's regex engine knows about graphemes: \X matches one extended grapheme cluster, so it counts the flag as one. The trap is forgetting to decode on input or encode on output, which silently mixes byte and character strings - "there is more than one way to do it" cuts both ways here.

Raku (the language formerly entangled with Perl) takes the boldest stance of all: its Str is a sequence of graphemes by default, normalized to NFC and abstracted as opaque NFG synthetic code points. "café".chars is 4 whether the accent is precomposed or combining, because both normalize to the same grapheme. You drop to code points with .codes and to bytes with .encode. Raku is the only mainstream language where the default unit matches human intuition out of the box, at the cost of doing real Unicode work on every string.

say "café".chars;     # 4 graphemes (NFC-normalized, accent-independent)
say "🇫🇷".chars;       # 1 grapheme
say "café".codes;     # 4 or 5 code points depending on the source

The typed and the functional: Haskell and OCaml

Haskell's String is the historical embarrassment: type String = [Char], a lazy linked list of code points. A Char is a Unicode scalar value, so length counts code points, but the list representation is slow and memory-heavy. Real code uses Data.Text (code-point-oriented, UTF-8 internally since text-2.0) for human text or Data.ByteString for bytes. The type system keeps the two from mixing without an explicit encodeUtf8 / decodeUtf8, which is Haskell's whole ethos - the byte/character distinction is a type distinction:

import qualified Data.Text as T
T.length (T.pack "café")   -- 4 (code points)

Neither base nor text does grapheme segmentation; that is library territory.

OCaml is the most byte-honest of the typed languages. Its string is simply an immutable byte sequence, and String.length is bytes, full stop: String.length "café" is 5. The standard library historically did almost no Unicode at all - correct code-point work means a library (uutf for decoding, uucp for properties, uuseg for grapheme segmentation). OCaml does not pretend your bytes are characters; it makes you ask.

Guji: code points, verified

Guji (in-house, v0.1-alpha as of 2026-06-19) sits near Python's position: its Str iterates as code points, and split("") walks scalar values rather than bytes. Compiled and statically typed, it runs the same on the reference interpreter and the native AOT backend - the output below is byte-identical from both guji-fresh file.guji and guji-fresh build -o out file.guji:

sub main() {
    $precomposed = "café"          # e-acute as one code point
    $decomposed  = "café"          # "cafe" + combining accent U+0301
    $pc = $precomposed.split("").count()
    $dc = $decomposed.split("").count()
    print("precomposed code points: $pc")   # 4
    print("decomposed code points:  $dc")    # 5

    @cps = "Aé→世".split("")
    print("code points: @cps")               # [A, é, →, 世]

    match "世界=42" ~~ /(?<k>\w+)=(?<v>\d+)/ {
        Some($m) {
            $k = $m<k>.unwrap_or("?")
            $v = $m<v>.unwrap_or("?")
            print("key=$k value=$v")          # key=世界 value=42
        }
        None { print("no match") }
    }
}

The split returns 4 for the precomposed é and 5 for the combining form, confirming guji counts code points, not graphemes, and that its first-class regex (the ~~ operator returning Option[Match], with named captures read via $m<name>) handles non-ASCII text and word boundaries on CJK runs. One sharp edge worth knowing at this alpha: \u{...} escapes are not yet interpreted in string literals, so multi-byte characters must be typed directly into the source (itself UTF-8). Grapheme-cluster length is not in the standard library yet - which puts guji in good company with most of this list.

The takeaway

Three positions emerge. The byte camp (OCaml, plus Rust and Go for their default len) is fast and honest but offloads Unicode onto you. The code-point camp (Python, Haskell's Text, guji, and Perl with its flag set) matches the Unicode data model and handles most real text correctly, while still miscounting emoji and combining sequences as multiple "characters." Only Raku defaults to graphemes - the unit humans actually mean. No layer is wrong; they answer different questions. The bug is almost always reaching for length without first deciding which of the three you are counting.