← History

Text Processing Through the Ages: Perl, Raku, Python, and guji

Four decades of teaching computers to read: from Perl's sigils and regexes to Raku's grammars, Python's batteries, and guji's first-class parsing.

PerlRakuPythonguji

Text Processing Through the Ages: Perl, Raku, Python, and guji

Every programming language eventually has to answer the same humble question: how do you pull meaning out of a pile of text? Log lines, CSV files, email headers, HTML, half-broken data dumps — the world arrives as strings, and someone has to turn it into structure. The tools each language reaches for tell you a great deal about its values. Here is that story across four languages, ending with one you have probably never heard of.

Perl: the text-processing language

When Larry Wall released Perl 1.0 to the comp.sources.misc Usenet newsgroup on December 18, 1987, he was building the missing rung between the shell and C. awk and sed were too small; C was too big. Perl glued them together and made regular expressions a first-class part of the syntax, not a library you imported.

That single decision shaped a generation of system administration. A whole program could be a perl -e one-liner. Here is a word-frequency counter, the kind of thing Perl was born to do:

my %count;
$count{$_}++ for split /\s+/, "the cat the dog the bird";
say "$_=$count{$_}" for sort keys %count;
# the=3, bird=1, cat=1, dog=1

Three sigils carry the type system on their sleeve — $ scalar, @ array, % hash — and /\s+/ is just there, no import re required. Wall, a trained linguist, designed Perl to feel like a natural language: redundant, expressive, forgiving, with idioms for every dialect of programmer. The famous motto is TMTOWTDI — "There's More Than One Way To Do It." But even Wall knew the danger of that freedom. As he put it:

"Although the Perl Slogan is There's More Than One Way to Do It, I hesitate to make 10 ways to do something." — Larry Wall

That tension — expressive power versus too many ways — is the thread that runs through everything below.

Raku: grammars as a language feature

Perl 6 was announced in 2000 as a clean-slate redesign. It took so long, and diverged so far, that in October 2019 Larry Wall blessed renaming it Raku, formally splitting it from Perl 5 as a sister language. (The first stable spec, 6.c "Christmas", had shipped on December 25, 2015 via the Rakudo implementation.)

Raku kept the sigils but made them invariant@array[0] stays @, no more shimmering between $ and @ on access — and then it did something genuinely new: it promoted regular expressions into full grammars. A Raku grammar is a class whose methods are named rules, composable and recursive, capable of parsing real languages:

grammar Email {
    token TOP    { <user> '@' <domain> }
    token user   { \w+ }
    token domain { \w+ '.' \w+ }
}

say Email.parse('ada@example.com')<user>;   # 「ada」

This was a quiet revolution. Most languages treat parsing as a heavyweight ceremony — bring in a parser generator, write a .y file, run a build step. Raku said: parsing is text processing, and text processing is the language's job. The cost was complexity; Raku is enormous, and that ambition is exactly why it took fifteen years to stabilize.

Python: text processing with batteries included

Guido van Rossum first released Python in February 1991, and it grew up with the opposite philosophy from Perl. Where Perl said TMTOWTDI, Python's import ‑this Zen says "There should be one — and preferably only one — obvious way to do it." Regexes are deliberately not in the syntax; they live in the re module, visible and explicit.

import re
from collections import Counter

text = "the cat the dog the bird"
counts = Counter(re.findall(r"\w+", text))
print(counts.most_common(1))   # [('the', 3)]

Python's bet was readability and a deep standard library — "batteries included." You reach for str.split, re, csv, json, collections.Counter. It is not as terse as Perl and has no native grammars like Raku, but it is obvious, and a team can maintain it. For two decades that trade-off won the data-wrangling world, and today most text munging that used to be a Perl one-liner is a Python script.

guji: text and grammars as primitives

Which brings us to guji — a young, statically typed, functional-first language that treats text processing as its signature capability. guji studied this whole lineage and made a pointed set of choices. It keeps Perl's sigils but, like Raku, makes them invariant and part of the name: $x is always scalar, @xs always a list, %m always a map. It adopts Python's creed almost verbatim — "one obvious way" — and pairs it with immutable-by-default bindings.

The Perl word-counter above becomes a chain of immutable transforms. Bindings do not mutate; a "modified" map is a new map:

sub main(): Int {
    $text = "the cat the dog the bird"
    @words = $text.split(" ")
    mut %counts = {}
    for $w in @words {
        $n = %counts.get($w).unwrap_or(0)
        %counts = %counts.set($w, $n + 1)
    }
    print(%counts.get("the").unwrap_or(0))   # 3
    0
}

Like Perl, guji makes regexes first-class literals with a dedicated match operator ~~ — but it returns a typed Option[Match], so absence is in the type system rather than a magic undef:

match 'ada@example.com' ~~ /(?<user>\w+)@(?<host>\w+)/ {
    Some($m) {
        $u = $m<user>.unwrap_or('?')
        print("user $u")     # user ada
    }
    None { print("no match") }
}

And like Raku — this is guji's headline feature — PEG grammars are a built-in type, not a library, with the same syntax growing out of regex:

grammar Email {
    rule  TOP    { <user> '@' <domain> }
    token user   { \w+ }
    token domain { \w+ '.' \w+ }
}

# Email.parse('ada@example.com') yields Some(Bush);
# $b<user>.text is "ada"

A guji grammar parses to a Bush — a typed parse tree whose captures are sub-nodes, not flat strings — and because production names are known at compile time, $b<user> is checked before the program ever runs. Where Raku's grammars are powerful but dynamic, guji's are powerful and statically verified.

The functional-first style means transformations compose left-to-right through a single mechanism — the dot — with no separate pipeline operator:

@nums.filter({ $_ % 2 == 0 }).map({ $_ * 2 }).sum()   # 24, for [1..6]

guji even borrows from outside the Perl family: its concurrency is Go-style (hatch for tasks, channels for communication), with the twist that every value crossing a channel is immutable — so the data-race class of bugs is structurally impossible.

The arc

Read across forty years and a shape emerges. Perl made text processing native and expressive, and paid in ambiguity. Raku made grammars native, and paid in size. Python made everything explicit and obvious, and paid by pushing regexes back into a library. guji is a deliberate synthesis: Perl's first-class regex, Raku's first-class grammars, Python's "one obvious way," wrapped in immutability and static types so that the parser the language hands you is also the parser the compiler can check.

Larry Wall once described Perl as a postmodern language — one that embraces every context and dialect at once. guji is something closer to the opposite instinct: it believes that for reading text, there should be exactly one good way, and the language should be the one to provide it.