Strings and Unicode: bytes, code points, and graphemes across eight languages
What does length even mean? Eight languages, three answers - bytes, code points, and graphemes.
Ask eight languages how long the string café is and you can get three different
numbers. The string is four user-perceived characters. But if the é is a
precomposed code point (U+00E9) it is four code points and five UTF-8 bytes; if
it is e followed by a combining acute accent (U+0301) it is five code points and
six bytes. Every honest answer to "how long is this string?" first has to answer
"length in what unit?" - bytes, Unicode scalar values (code points), or grapheme
clusters (the things humans call characters). The languages below stake out
strikingly different positions on that question.
Bytes by default: Rust and Go
Rust and Go both store text as UTF-8 bytes and both make that physical layout
visible, but they expose it differently. A Rust String / &str is guaranteed
valid UTF-8. len() returns the byte count, and indexing by byte offset is
allowed only on character boundaries - slice through the middle of a multi-byte
sequence and you panic. To get code points you iterate explicitly:
let s = "café"; // precomposed
assert_eq!(s.len(), 5); // bytes
assert_eq!(s.chars().count(), 4); // char = one Unicode scalar value
A Rust char is exactly one scalar value (4 bytes wide), never a grapheme.
Clusters live in the unicode-segmentation crate, deliberately outside std.
Go is looser. A string is an immutable, possibly-invalid byte slice; len(s)
is bytes, and a plain for i := range s decodes UTF-8 and yields rune values
(Go's name for a code point) along with their byte offsets. Iterating by index
gives you raw bytes; iterating by range gives you runes. Invalid bytes decode
to U+FFFD rather than erroring, which is convenient and occasionally a footgun.
Code points as the unit: Python
Python 3 picks the code point as its atom and hides the storage. A str is a
sequence of Unicode scalar values; CPython uses a flexible internal
representation (latin-1, UCS-2, or UCS-4) but you never see it. len counts code
points, and you cross into bytes only by explicit .encode():
s = "café" # 4 code points
import unicodedata
d = unicodedata.normalize("NFD", s) # 5 code points
print(len(s), len(d)) # 4 5
print(len(s.encode("utf-8"))) # 5 bytes
print(len("🇫🇷")) # 2 (two regional indicators)
That last line is the recurring lesson: Python's "character" is a code point, so a
flag emoji is two. Grapheme clustering needs a third-party library such as
regex or grapheme. Indexing is O(1) by code point because the buffer is fixed
width, which is exactly why CPython refuses to store UTF-8 directly.
Bytes that learn to be characters: Perl and Raku
Perl is the cautionary tale and the redemption arc. A Perl scalar is bytes
unless you tell it otherwise; use utf8 marks the source as UTF-8, and the
internal UTF8 flag decides whether length counts bytes or characters. The same
literal can be four or five long depending on that flag:
use utf8;
my $s = "café"; # 4 characters
say length $s; # 4
my $flag = "🇫🇷";
say length $flag; # 2 code points
my $g = () = $flag =~ /\X/g;
say $g; # 1 grapheme cluster, via \X
Crucially Perl's regex engine knows about graphemes: \X matches one extended
grapheme cluster, so it counts the flag as one. The trap is forgetting to decode
on input or encode on output, which silently mixes byte and character strings -
"there is more than one way to do it" cuts both ways here.
Raku (the language formerly entangled with Perl) takes the boldest stance of
all: its Str is a sequence of graphemes by default, normalized to NFC and
abstracted as opaque NFG synthetic code points. "café".chars is 4 whether the
accent is precomposed or combining, because both normalize to the same grapheme.
You drop to code points with .codes and to bytes with .encode. Raku is the
only mainstream language where the default unit matches human intuition out of
the box, at the cost of doing real Unicode work on every string.
say "café".chars; # 4 graphemes (NFC-normalized, accent-independent)
say "🇫🇷".chars; # 1 grapheme
say "café".codes; # 4 or 5 code points depending on the source
The typed and the functional: Haskell and OCaml
Haskell's String is the historical embarrassment: type String = [Char], a
lazy linked list of code points. A Char is a Unicode scalar value, so length
counts code points, but the list representation is slow and memory-heavy. Real
code uses Data.Text (code-point-oriented, UTF-8 internally since text-2.0) for
human text or Data.ByteString for bytes. The type system keeps the two from
mixing without an explicit encodeUtf8 / decodeUtf8, which is Haskell's whole
ethos - the byte/character distinction is a type distinction:
import qualified Data.Text as T
T.length (T.pack "café") -- 4 (code points)
Neither base nor text does grapheme segmentation; that is library territory.
OCaml is the most byte-honest of the typed languages. Its string is simply an
immutable byte sequence, and String.length is bytes, full stop:
String.length "café" is 5. The standard library historically did almost no
Unicode at all - correct code-point work means a library (uutf for decoding,
uucp for properties, uuseg for grapheme segmentation). OCaml does not pretend
your bytes are characters; it makes you ask.
Guji: code points, verified
Guji (in-house, v0.1-alpha as of 2026-06-19) sits near Python's position: its
Str iterates as code points, and split("") walks scalar values rather than
bytes. Compiled and statically typed, it runs the same on the reference
interpreter and the native AOT backend - the output below is byte-identical from
both guji-fresh file.guji and guji-fresh build -o out file.guji:
sub main() {
$precomposed = "café" # e-acute as one code point
$decomposed = "café" # "cafe" + combining accent U+0301
$pc = $precomposed.split("").count()
$dc = $decomposed.split("").count()
print("precomposed code points: $pc") # 4
print("decomposed code points: $dc") # 5
@cps = "Aé→世".split("")
print("code points: @cps") # [A, é, →, 世]
match "世界=42" ~~ /(?<k>\w+)=(?<v>\d+)/ {
Some($m) {
$k = $m<k>.unwrap_or("?")
$v = $m<v>.unwrap_or("?")
print("key=$k value=$v") # key=世界 value=42
}
None { print("no match") }
}
}
The split returns 4 for the precomposed é and 5 for the combining form,
confirming guji counts code points, not graphemes, and that its first-class
regex (the ~~ operator returning Option[Match], with named captures read via
$m<name>) handles non-ASCII text and word boundaries on CJK runs. One sharp
edge worth knowing at this alpha: \u{...} escapes are not yet interpreted in
string literals, so multi-byte characters must be typed directly into the source
(itself UTF-8). Grapheme-cluster length is not in the standard library yet -
which puts guji in good company with most of this list.
The takeaway
Three positions emerge. The byte camp (OCaml, plus Rust and Go for their default
len) is fast and honest but offloads Unicode onto you. The code-point camp
(Python, Haskell's Text, guji, and Perl with its flag set) matches the Unicode
data model and handles most real text correctly, while still miscounting emoji
and combining sequences as multiple "characters." Only Raku defaults to
graphemes - the unit humans actually mean. No layer is wrong; they answer
different questions. The bug is almost always reaching for length without first
deciding which of the three you are counting.