The Perverse Beauty of Regex: A Love Letter to the World’s Most Efficient Torture Device

The Perverse Beauty of Regex: A Love Letter to the World’s Most Efficient Torture Device
Photo by Darwin Vegher / Unsplash

It starts innocently. You just want to search for something. A word, a phone number, perhaps the occasional email address belonging to your former classmate who still owes you £60 for books. You could do this the normal way, with a search box like a civilised human, or you could instead plunge headfirst into the surreal wonderland of regular expressions, a language inspired by neuroscientists trying to understand the human brain, and which has since evolved into one of computing’s favourite instruments of both elegance and suffering.

Regex. A word that strikes fear, reverence and mild nausea into the hearts of developers everywhere. It is the linguistic equivalent of looking at a Jackson Pollock painting and pretending you understand. At first glance, it appears like line noise on a modem. After an hour, it looks like line noise on a modem. Come back six months later and it still looks like line noise on a modem, only now you resent yourself for ever thinking you could tame it.

And yet, for many of us, regex is irresistible. It is powerful, precise, cryptic, and capable of both astonishing feats of pattern recognition and absolute carnage if written incorrectly. It is the string-matching version of summoning a demon: you can call forth unimaginable power, but slip up even slightly and you’ll spend days cleaning blood from the ceiling.

To appreciate regex properly, we must first visit its unexpected roots. Like many unhinged things in computer science, regular expressions were not invented with the intention of ruining people’s weekend debugging sessions. They emerged from 1950s research into neurophysiology. Mathematician Stephen Kleene modelled brain impulses using what would become finite automata, and accidentally birthed the theoretical foundation for a pattern-matching mechanism that programmers now use to validate emails, extract data, and occasionally trigger existential crises. Imagine going to a neuroscience conference and returning home with the basis of every developer’s worst nightmare. That is regex in a nutshell.

Oddly enough, modern regex isn’t even regular. A true regular language is computationally simple enough to be recognised by a finite automaton. That means predictable. Efficient. Nice. Modern regex engines, however, are the linguistic equivalent of giving a bread knife to a toddler and saying, “Technically this is safe if they behave.” With the introduction of features like backreferences, lookbehinds, lookaheads and recursive patterns, most regex engines are now powerful enough to simulate a Turing machine (in the context of practical computing, rather than mathematical theory). In other words, a regex can theoretically compute anything. A more helpful phrasing might be: regex can theoretically ruin your life.

Before diving into the madness, let us look at a harmless example. Something gentle, warm, the computational equivalent of a chamomile tea. Suppose we want to match the word “cat”.

cat 

Yes, that works. No special symbols. No tears. No internal screaming. You could even do this without regex at all, but here we are. Now suppose we want to match “cat” and “cats”.

cats?

That question mark means “optional”, which is delightful in a polite way, as though the regex itself is saying, “One may bring the plural if one wishes, but one would not be frowned upon for arriving singular.” After twenty minutes of this, you will find yourself craving complexity. You crave pain.

This is where things get interesting. Logic sneaks in. Choices appear. You are creating branching paths like some deranged Choose Your Own Adventure where every ending involves a stack overflow. Before long you have built the entire taxonomy of mammalian life in one regex, and you haven’t seen daylight in three hours.

Regex, like real wildlife, must be treated carefully. Some patterns are perfectly tame. A few characters, a dash here or there, maybe a quantifier or two, and everything behaves. But introduce catastrophic backtracking, and what was once a harmless kitten becomes a tiger with a taste for CPU cycles. Catastrophic backtracking occurs when a poorly constructed regex meets a deceptively simple input, and the engine decides to try every possible path, every permutation, every branch of its logical tree, until your programme collapses like an Edwardian aunt at the sight of an exposed ankle.

Consider this charming tragedy of a regex:

(a+)+$

Now feed it the string aaaaaaaaaaaaaaaaab. The regex tries to match all those as in so many combinations it could rewrite the Mahabharata while it’s at it. Your machine begins wheezing. Fans spin. Your room grows warm. You hear faint screams as the CPU melts. This is known as ReDoS, Regular Expression Denial of Service, proving that regex can be used not only to parse data but also as a blunt instrument of cybercrime. It is the linguistic equivalent of sending someone a glitter bomb.

Not that regex is all doom, ruin and hair loss. Used properly it can be a scalpel, slicing data with elegant precision. Need to extract phone numbers from a million documents? Regex. Want to validate an IPv4 address? Regex. Desire to verify whether a string is a floating-point IEEE 754 representation? Regex, though if this is what you're doing with your spare time, one suspects something has gone catastrophically wrong in your life.

Some people, driven by an intoxicating cocktail of ego and madness, have even created regex patterns that match prime numbers. Using nothing but characters like ^, $, +, ? and parentheses. At this point regex has become less of a tool and more of an elaborate cry for help.

It is worth noting that regex as we know it was not always fashionable. It was Perl that turned regex into an icon, wrapping it into the core language like an affectionate parasite. Perl popularised the chaotic syntax we now recognise today, and many engines still advertise themselves as “PCRE”, Perl Compatible Regular Expressions, as if compatibility with one of the most cryptic programming languages known to humanity is a selling point.

Yet Perl’s contribution cannot be understated. Without it, regex might still look like something from an academic paper, reserved for mathematicians and socially suspect graduate students. Instead, it became mainstream. JavaScript, Python, Ruby, C#, Java, they all absorbed regex, much in the same way one absorbs a parasite after swimming in an unchlorinated pond. And like parasites, regex occasionally causes hallucinations.

For example, let us attempt the completely reasonable task of matching an email address. You might write something innocent, like:

.+@.+\..+

And then you will discover that this matches delicious@apple.pie , @puppy.com, veryold@ and quite likely your emotional baggage if encoded as ASCII. You refine, tweak, iterate, and before long you have birthed something unspeakable:

^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,63}$

This looks impressive until someone shows you a legal but cursed email like john..doe@sub.domain.co.uk or a@b.c and suddenly one of your eyes begins to twitch. Fully compliant email reges are famously monstrous. There exists a RFC-5322-compliant regex on the internet, created by someone who has almost certainly spilled paint thinner on their frontal lobe. It is technically correct, which is the worst kind of correct.

If email validation is climbing a hill, Unicode emoji matching is climbing Everest naked with a broken ankle. Emojis are not simple characters. They are compositions of symbols, skin tones, modifiers, zero-width joiners, flags constructed using pairs of regional indicator characters, and family sequences representing every conceivable arrangement of adults and children. The official regex for “all emojis” is not a pattern so much as a Lovecraftian scripture. You don’t read it. You survive it.

Despite all this, regex carries an almost addictive pleasure. There is nothing quite like crafting a pattern that plucks exactly what you want from a text. It feels like witchcraft. You utter incantations like (?<=foo)(bar)(?!baz) and suddenly the forbidden word appears, hovering splendidly at the edge of logic. A lookbehind asserts what comes before, a lookahead asserts what follows, and you feel as though you have achieved enlightenment or possibly schizophrenia. Eventually the two become indistinguishable.

Some programming communities have even turned regex into a kind of sport. Puzzles, contests, time trials. People compete to solve pattern-matching challenges as if this were not a warning sign. One might watch a grown adult celebrate because they matched nested parentheses without recursion, and realise there is truly no hope left for humanity.

Let us step back and reflect on why regex inspires such devotion. It is because regex is concise. Obscenely concise. It compresses concepts like “five to eight hexadecimal characters not beginning with zero unless followed by exactly three letters or a digit” into a single unreadable cluster of punctuation. It is information density achieved through obfuscation.

Example. Suppose you want to find every word beginning with “th” and ending in “ing”, but not containing the letter “r”. With regex you might write:

\bth(?![^ ]*r)[a-zA-Z]*ing\b

To read this pattern is to lose your sense of self. But run it and suddenly you have extracted “thinking”, “throbbing” (wait, no, that one has an r), “thriving” (has r again), “thanking” (allowed), “thumping” (allowed?), depending entirely on how cruel you want to be with character classes. Regex gives you godlike power over text, assuming you remember what you were trying to do.

Now consider recursion. Some regex engines support it. With recursion, you can match arbitrarily nested constructs like parentheses, XML tags, or human despair. Given a string such as (a(b(c)d)e), the regex might look like:

\((?:[^()]+|(?R))*\)

This pattern is a mirror staring back at itself. It matches balanced parentheses. It also matches the dark recesses of your soul. Recursion makes regex Turing-complete, which means you can technically write a regex that plays Tetris or mines cryptocurrency. If you ever meet someone doing this, please take away their keyboard for the good of society.

After spending this much time in regex theory, it is healthy to ground ourselves in practicality. Let us imagine a text containing thousands of addresses, written with the chaotic inconsistency of a landlord documenting repairs after losing half their memory to fumes. We want to extract postcodes.

A typical UK postcode looks like this:

[A-Z]{1,2}[0-9R][0-9A-Z]?\s*[0-9][A-Z]{2}

This single expression respects the idiosyncrasies of British geography, which says everything you need to know about both regex and Britain. It matches SW1A 1AA, EC1V 9LB, and a depressingly accurate representation of Britain’s inner chaos. The space may or may not exist because of course it may or may not. Regex handles this ambiguity with quiet, judgemental grace.

There is something sublime about using symbols like {3}, [^,], and \s+ to control information. Regex allows you to reach into a haystack and pluck out exactly the right needle, provided you remember the difference between greedy and lazy quantifiers. If you forget, regex will take everything including the barn, the horse and half the neighbouring village.

Greedy quantifiers match as much as possible. Lazy quantifiers match as little as possible. Observe the difference:

greedy:   <.*>
lazy:     <.*?>

Feed them the input <cat><dog>. The greedy version consumes everything like a spiteful vacuum cleaner, matching <cat><dog> in one go. The lazy version neatly captures <cat>, then <dog>, showing restraint. One feels like a polite dinner guest, the other like a raccoon with a shopping trolley.

Regex does not merely parse text. It exposes patterns we did not know existed. It teaches us that humans are inconsistent. That data is messy. That no matter how carefully we plan, there will always be that one edge case who registers with an email address written in ancient Sumerian.

It is tempting, after such contemplation, to view regex as malicious. But that is unfair. Regex is merely unforgiving. It is like a piano. Beautiful music may emerge, or you may slam your hands down and unleash atonal horror. The instrument doesn’t care. It waits patiently for the next victim.

Imagine a developer attempting to write a regex to validate dates. A noble goal. A doomed endeavour. The pattern grew like ivy, twisting across lines, accumulating parentheses, reaching sentience. They insisted on accounting for leap years. This requires knowledge of divisible by four, except centuries unless divisible by four hundred, except years divisible by one hundred but not four hundred, except if you're Pope Gregory and feel like rewriting the entire calendar. Their regex became a scripture of madness:

^((19|20)\d\d)-(0[1-9]|1[0-2])-(0[1-9]|[12][0-9]|3[01])$

This covers many cases but not February 29th of a leap year. To handle that, the developer contemplates backreferences, conditional subpatterns, and by the end is rocking gently under their desk whispering, “Time is an illusion.” Sometimes the best regex is the one you don’t write.

Yet we keep writing them anyway. We call them “one-liners” even when they stretch to several screens. We embed them in code comments like cursed runes that warn future maintainers: venture here and despair. We paste them into Slack with a feeling of triumph, knowing full well nobody will dare ask how it works because nobody wants to be responsible for maintaining it.

Regex is beautiful because it demands respect. It rewards precision. It punishes carelessness with gleeful brutality. It stands as a reminder that language and logic are deeply intertwined, and that sometimes the shortest code is also the most incomprehensible.

If you’ve read this far, you are either curious, masochistic, or stuck on a train without Wi-Fi. Either way, you now know far more about regex than you probably intended. You may even feel the urge to open an editor and experiment. If so, here is your final test.

You must extract all lines from a file that contain a number between 13 and 666, inclusive, ending with the word "goat". You might write:

\b(1[3-9]|[2-9][0-9]|[1-5][0-9]{2}|6[0-5][0-9]|66[0-6])\s+goat\b

If you stared at that expression and thought, “Yes, that seems fine,” congratulations. You are now too deep to turn back.

Regex is not good or evil. It is a mirror. Some see clean digital structure. Others see raw chaos and arcane symbols. All of us, eventually, see both. And as you continue your journey through this syntactical labyrinth, remember one thing.

It never gets easier. You just get weirder.

References:

Davis, J.C., Coghlan, Ch.A., Servant, F., and Lee, D. (2018). The impact of regular expression denial of service (ReDoS) in practice: an empirical study at the ecosystem scale. In Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 2018). Association for Computing Machinery, New York, NY, USA, 246–256. Available at: https://doi.org/10.1145/3236024.3236027 (Accessed: 4 December 2025).

Wikipedia contributors (2025). Perl Compatible Regular Expressions. In Wikipedia, The Free Encyclopedia. Available at: https://en.wikipedia.org/w/index.php?title=Perl_Compatible_Regular_Expressions&oldid=1318155013 (Accessed: 4 December 2025).

Kleene, S. C. (1956). Representation of Events in Nerve Nets and Finite Automata. RAND Corporation, 15 December. Available at: https://www.rand.org/content/dam/rand/pubs/research_memoranda/2008/RM704.pdf (Accessed: 4 December 2025).

Regular Expression 101 (2019). Regex101: build, test, and debug regex. Available at: https://regex101.com/ (Accessed: 4 December 2025).

Read more