Text Processing Through the Ages: Perl, Raku, Python, and guji
Four decades of teaching computers to read: from Perl's sigils and regexes to Raku's grammars, Python's batteries, and guji's first-class parsing.
Text Processing Through the Ages: Perl, Raku, Python, and guji
Every programming language eventually has to answer the same humble question: how do you pull meaning out of a pile of text? Log lines, CSV files, email headers, HTML, half-broken data dumps — the world arrives as strings, and someone has to turn it into structure. The tools each language reaches for tell you a great deal about its values. Here is that story across four languages, ending with one you have probably never heard of.
Perl: the text-processing language
When Larry Wall released Perl 1.0 to the comp.sources.misc Usenet newsgroup
on December 18, 1987, he was building the missing rung between the shell and C.
awk and sed were too small; C was too big. Perl glued them together and made
regular expressions a first-class part of the syntax, not a library you imported.
That single decision shaped a generation of system administration. A whole
program could be a perl -e one-liner. Here is a word-frequency counter, the kind
of thing Perl was born to do:
my %count;
$count{$_}++ for split /\s+/, "the cat the dog the bird";
say "$_=$count{$_}" for sort keys %count;
# the=3, bird=1, cat=1, dog=1
Three sigils carry the type system on their sleeve — $ scalar, @ array, %
hash — and /\s+/ is just there, no import re required. Wall, a trained
linguist, designed Perl to feel like a natural language: redundant, expressive,
forgiving, with idioms for every dialect of programmer. The famous motto is
TMTOWTDI — "There's More Than One Way To Do It." But even Wall knew the danger
of that freedom. As he put it:
"Although the Perl Slogan is There's More Than One Way to Do It, I hesitate to make 10 ways to do something." — Larry Wall
That tension — expressive power versus too many ways — is the thread that runs through everything below.
Raku: grammars as a language feature
Perl 6 was announced in 2000 as a clean-slate redesign. It took so long, and diverged so far, that in October 2019 Larry Wall blessed renaming it Raku, formally splitting it from Perl 5 as a sister language. (The first stable spec, 6.c "Christmas", had shipped on December 25, 2015 via the Rakudo implementation.)
Raku kept the sigils but made them invariant — @array[0] stays @, no more
shimmering between $ and @ on access — and then it did something genuinely
new: it promoted regular expressions into full grammars. A Raku grammar is a
class whose methods are named rules, composable and recursive, capable of parsing
real languages:
grammar Email {
token TOP { <user> '@' <domain> }
token user { \w+ }
token domain { \w+ '.' \w+ }
}
say Email.parse('ada@example.com')<user>; # 「ada」
This was a quiet revolution. Most languages treat parsing as a heavyweight
ceremony — bring in a parser generator, write a .y file, run a build step. Raku
said: parsing is text processing, and text processing is the language's job.
The cost was complexity; Raku is enormous, and that ambition is exactly why it took
fifteen years to stabilize.
Python: text processing with batteries included
Guido van Rossum first released Python in February 1991, and it grew up
with the opposite philosophy from Perl. Where Perl said TMTOWTDI, Python's import
‑this Zen says "There should be one — and preferably only one — obvious way to do
it." Regexes are deliberately not in the syntax; they live in the re module,
visible and explicit.
import re
from collections import Counter
text = "the cat the dog the bird"
counts = Counter(re.findall(r"\w+", text))
print(counts.most_common(1)) # [('the', 3)]
Python's bet was readability and a deep standard library — "batteries included."
You reach for str.split, re, csv, json, collections.Counter. It is not
as terse as Perl and has no native grammars like Raku, but it is obvious, and a
team can maintain it. For two decades that trade-off won the data-wrangling world,
and today most text munging that used to be a Perl one-liner is a Python script.
guji: text and grammars as primitives
Which brings us to guji — a young, statically typed, functional-first language
that treats text processing as its signature capability. guji studied this whole
lineage and made a pointed set of choices. It keeps Perl's sigils but, like Raku,
makes them invariant and part of the name: $x is always scalar, @xs always
a list, %m always a map. It adopts Python's creed almost verbatim — "one obvious
way" — and pairs it with immutable-by-default bindings.
The Perl word-counter above becomes a chain of immutable transforms. Bindings do not mutate; a "modified" map is a new map:
sub main(): Int {
$text = "the cat the dog the bird"
@words = $text.split(" ")
mut %counts = {}
for $w in @words {
$n = %counts.get($w).unwrap_or(0)
%counts = %counts.set($w, $n + 1)
}
print(%counts.get("the").unwrap_or(0)) # 3
0
}
Like Perl, guji makes regexes first-class literals with a dedicated match
operator ~~ — but it returns a typed Option[Match], so absence is in the type
system rather than a magic undef:
match 'ada@example.com' ~~ /(?<user>\w+)@(?<host>\w+)/ {
Some($m) {
$u = $m<user>.unwrap_or('?')
print("user $u") # user ada
}
None { print("no match") }
}
And like Raku — this is guji's headline feature — PEG grammars are a built-in type, not a library, with the same syntax growing out of regex:
grammar Email {
rule TOP { <user> '@' <domain> }
token user { \w+ }
token domain { \w+ '.' \w+ }
}
# Email.parse('ada@example.com') yields Some(Bush);
# $b<user>.text is "ada"
A guji grammar parses to a Bush — a typed parse tree whose captures are
sub-nodes, not flat strings — and because production names are known at compile
time, $b<user> is checked before the program ever runs. Where Raku's grammars are
powerful but dynamic, guji's are powerful and statically verified.
The functional-first style means transformations compose left-to-right through a single mechanism — the dot — with no separate pipeline operator:
@nums.filter({ $_ % 2 == 0 }).map({ $_ * 2 }).sum() # 24, for [1..6]
guji even borrows from outside the Perl family: its concurrency is Go-style
(hatch for tasks, channels for communication), with the twist that every value
crossing a channel is immutable — so the data-race class of bugs is structurally
impossible.
The arc
Read across forty years and a shape emerges. Perl made text processing native and expressive, and paid in ambiguity. Raku made grammars native, and paid in size. Python made everything explicit and obvious, and paid by pushing regexes back into a library. guji is a deliberate synthesis: Perl's first-class regex, Raku's first-class grammars, Python's "one obvious way," wrapped in immutability and static types so that the parser the language hands you is also the parser the compiler can check.
Larry Wall once described Perl as a postmodern language — one that embraces every context and dialect at once. guji is something closer to the opposite instinct: it believes that for reading text, there should be exactly one good way, and the language should be the one to provide it.