A story of one regexp

I find regexp writing very entertaining. When I need to perform some boring text manipulation - let’s say extracting URLs from raw logs - I’d rather spend 10 minutes tweaking a regexp than the same amount of time doing it manually.

Back in my PHP days, I was lucky to write hundreds of regular expressions for matching URL paths, tweaking HTML, or validating user input. I’ve heard horror stories about dramatic regexp incidents, but I’ve never experienced anything like that myself. Until this year.

Not a database problem

This year, I’ve been working a lot on a large Java monolith known for its severe performance issues.

One of these issues was reported by a user who couldn’t open an entity report screen in one of their projects. Normally, the GET endpoint responsible for the report responds in 1-3 seconds, which is acceptable1 for that use case. In the user’s case, it was taking 18Ā minutes, which obviously resulted in a timeout error.

The entity in question had hundreds of dependent entities and nested configurations. The team was pretty sure the slowdown was due to some ORM cruelties resulting in billions of database queries. Indeed, the telemetry trace confirmed 196 SELECT queries, but nothing particularly slow. The response was getting stuck right before the end of the processing, without any meaningful tracing information.

I followed the code for quite some time until I finally found it, the topic for this article!

The problem

The entity reports are generated in markdown using a custom heading ID syntax. These heading IDs are replaced by HTML anchors when rendering the report to HTML2. A quick replaceAll with a regexp is invoked to perform the replacement.

If you like regular expressions as much as I do, try to solve this problem using a regexp in JavaScript. Find the last occurring word in square brackets in each line. In the example below, ref_heading_1 and ref_another_heading should be matched.

The original code used something like this (click to reveal):

It suffers from the problem known as ā€œcatastrophic backtracking,ā€ the same issue that brought down Cloudflare six years ago3.

In the regexp above, the [^\n]* part greedily matches the entire line and then backtracks character by character in an attempt to find a \[(\w+)\] match. When executed as is, without any optimizations, this results in exponential time complexity.

18 min vs 7ms

Before fixing the problem, I want to take a quick detour and share how that poor regexp is executed by different languages.

I took ~80k lines of markdown with 14 matching square brackets and ran this regexp on my M2 mac. Try to guess my results:

This is not a proper benchmark by any means, but now you know why I didn’t have any problems with regular expressions early in my career.

Apparently, PHP has always been very tolerant to poor regexp design by using a highly optimized PCRE library under the hood. Since 7.3, it uses PCRE2 with JIT, which makes regexp execution blazing fast, as you could see in my mini-benchmark.

On the other hand, Java does not rely on any library by default. It provides its own regexp implementation, which is far less optimized than specialized libraries. I could literally follow all the backtracking steps in the debugger to see how bad it was. As soon as I installed a specialized library4, I’ve got acceptable performance, although it was still nowhere near PCRE2 written in C.

If you check the leaderboard of the Java One Billion Row Challenge, you’ll notice that beyond a certain threshold, all solutions are essentially C code disguised as Java. Such optimizations must be as entertaining as regexp, but not for a sane production codebase.

A better regexp

Could we rewrite this regexp to achieve acceptable performance?

I usually start from the part I really want to match and then build the rest of the regexp around it. In this case I need a word starting with ref_ in square brackets:

\[(ref_\w+)\]

This will match all words in square brackets, but we only need the last one. We can specify that none of the next characters until the new line can be a square bracket:

\[(ref_\w+)\][^\[]*\n  āœ…

Figured out a better one? Share in comments!

The fix

As I said, I find regexp writing very entertaining. Still, the production code is not the right place for that. The mix of escaped square brackets makes this regexp hard to read and review.

Instead, I wrote a simple loop that iterates over the lines of the input text, looks for the specific pattern [ref_ using String.contains, and only if a match is found, invokes the replaceFirst method.

This solution is much faster than any regexp I tried, and also much easier to maintain. Now, time to look into those 196 SELECT queries 😈

Footnotes

  1. Acceptable for that particular system in that particular phase of its maintenance cycle šŸ¤“ ↩

  2. A more common format would be {#heading-id} but migrating would require much more effort: https://www.markdownguide.org/extended-syntax/#heading-ids ↩

  3. Details of the Cloudflare outage on July 2, 2019 (Appendix: About Regular Expression Backtracking) ↩

  4. RE2/J is a port of C++ library RE2 to pure Java. ↩


Profile

Hi, I'm Kate

I’m a Full Stack Developer and Engineering Mentor, obsessed with regular expressions, books, and web technologies. In my work, I mix old with new, soft with hard, cats with dogs. When it’s not a disaster, it’s pure magic!

← Building web in 2025