A corpus of regular expressions

For a project of mine I need a lot of python regular expressions for testing. I tried searching the web for "regular expression corpus" and similar terms, but due to the fact that regular expressions are a popular tool to analyze large corpora of text, this did not yield much.

I found RegExLib.com, which even provides a web service for accessing the almost 4000 regular expressions that they collected, but the service seems to be broken.

Then I wrote a small script that uses the GitHub code search to find python regular expressions. This script uses regular expressions to find regular expressions. As every CS student should know, this is theoretically impossible, so I cheated by matching the parameters of functions that are regular expressions with high probability (re.match,re.search,...). This can go wrong if someone (re)defines functions in an unfortunate way, but this is a risk I am willing to take. Due to their limit on the number of search results, I got only 1300 regexes out of that.

I was a bit surprised how simple most of these regexes were. Many of them were just text, few used more than basic features. Scraping seems a popular use for regexes.

If you are interested, grab the code here

Comments

Pages

Categories

Tags