A Better Grep

The standard implementations for searching with regular expressions (e.g. grep) have exponential behavior on some input. For example, using ``numerical constraints'' it is easy to make grep one-liners that consume over 8 GB of memory. There are polynomial-time algorithms for a subset of regular expressions with numerical constraints. Is it possible to combine the (usually extremely efficient) optimisations and extensions done in grep with algorithms that do not have exponential behavior?

The standard tools for searching with regular expressions include an extension called numerical constraints. It is also a part of XML Schema. Using normal finite automata to search for such patterns leads to an exponential blow-up in the number of states. These automata are used since they allow fast searching of large amounts of text. It is easy to observe the exponential memory consumption of e.g. GNU grep. However, in 2008 and 2009, three papers describe algorithms that are polynomial in the search string, while still linear or quadratic in the search text. But the algorithms are not in widespread use, and the major reason is that noone has combined them with the optimisations present in tools like grep. Without these, performance on normal expressions is assumed to be too bad for large-scale use.

It is possible to take this master in both practical and theoretical directions, although there will be at least some theory to read up on and understand. It can be an almost purely theoretical thesis, only describing and combining algorithms, or it can focus on implementation and details of tools like grep.


Emneord: regular expressions, text search, algorithms
Publisert 6. juli 2018 12:55 - Sist endret 6. juli 2018 12:55