User:QEDK/GSoC 2020/Making regex in Python "just" work
Quite honestly, my summer working with Wikimedia has taught me more in regex than having used it all this time, albeit before now, itâs mostly been when someone hollers at me to write something. In any case, that involved going to regexr and actually verifying if my regex was doing what I thought it was doing, badly written regular expressions will probably be a bottleneck (or downfall, who knows!) in a lot of textual applications. Despite my general viewpoint of âcan we not use regex!ââââitâs actually more widely used and is more helpful than it gets credit for (enough that it involved me actually putting effort into learning it). Every now and then, I still mix up my ?
and +
s, causing some amount of snafu and me messaging âwhy is {1,}
working here but not ?
â into our Zulip stream, causing a decent bit of embarrassment when I realize my mistake a minute later.
I think the real impressiveness is when you try to parse markup with regex and that probably is a level of hell no one should have to face (Dante wrote about this, itâs true). More than often, when you want a solution to âjustâ work in one specific case, itâs actually a really good solution to use regex, not so much if youâre writing a general-case parser, please donât do thatâââchances are someone wrote one already and yes, someone probably wrote one with regex as well.
Learning regex
[edit]If you havenât thought of learning regex, you probably donât need to learn it yet but itâs a good skill to pick up regardless, really lets you do the really weird grep
s and text matches that you donât want to do with multiple, complicated text matches. Unfortunately, thereâs no catch-all way to learn regexes, in my opinion at least. I would say the easiest way is to pick it up gradually, open up regexr or regex101 and try writing your own solution for your use-cases. Both of these sites have cheatsheets and reference built-in, as well as explanations for the pattern you write as well as the text that gets matched. I recommend you keep the cheatsheet open and try to write it yourself.
And in the case of Python, the documentation is your best friend. In fact, a lot of standard libraries use it for a lot of things (configparser
for example). With complex regexes, itâs often easier to write code that could span multiple functions, while there is a loss of readability, writing properly documented regex goes some way to alleviate that issue.
You never know when regex will save the day!
âmatchâ and âsearchâ
[edit]When I started off with Pythonâs re
, this was my biggest mistake probably because they behaved differently and I didnât figure out why. The documentation makes it âamplyâ clear but that didnât stop me from using match
where there was no reason to use it. In fact, the documentation literally has an example: https://docs.python.org/3/library/re.html#search-vs-match but I missed it all the same.
The crux is that match
only matches from the beginning of the string while search
actually searches the whole string for a match, the name implies it just looks for a match but itâs just a bit offâââin any case, I was misled because the PCRE flavour of regex basically implements the behaviour of search
by default, so I had no reason to believe that match
would be unfit for the job (an act of pure folly). Soon enough, I think a friendly soul pointed me to the documentation after I was complaining about the regex function not working and told me what was going wrong (and thus, the regex show goes on).
>>> re.match("def", "abcdef") # No match
>>> re.search("def", "abcdef") # Match
Named groups
[edit]Regex by default allows groups and newer flavours typically allow named groups as well. Using index-based references, you can use a capturing group like (abc)
and then use a back-reference like \1
to refer to the first matching group. Similarly, a non-capturing group would look like (?:abc)
. Handy, right?
Itâs even easier and Pythonic to use named groups, so letâs say youâre trying to get someoneâs username from their email address (not a good idea but hey, good enough for an example), you would do something like:
matchobj = re.match("^(?P<id>[^@]+(?=@.+))", email)
Thatâs a fairly complicated regex (for all the wrong reasons), so letâs just clear it up:
^
matches the beginning of a string and if in MULTILINE mode (signified by there.M
flag in Python), it matches the beginning of every line.?P<id>
is to name the capture group (all things inside parentheses) for later.[^@]
is a character class for all characters except@
(the âexceptâ part of it being the^
), while not necessarily accurate for a validator, it works fine for our purpose.+
signifies the above token matches one or more times.?=
makes a positive lookahead, to ensure our match actually contains an at sign and some text after that (signified by@.+
) but doesnât actually match that part itself.
Now, we can easily extract the ID from the match object like:
id = matchobj.group("id")
Note that this while this is fine, the variable itself might be of NoneType
if it has no match, so itâs important to keep code safety in mind.
While we didnât strictly need a named group for this purpose, itâs more forthcoming with what your regex is trying to get at, the more transparent your code, the better it will be supported in the long run.
Compilation
[edit]If youâre using a lot of regex in a single program, you should ideally compile it. While there isnât a limit, if you are using more than a âfewâ, you should compile regex so that your performance doesnât take a hit. The loss itself should be negligible but it adds up in the long run, especially that compiling itself is so simple.
compiled = re.compile("\w*compilethis\w*", flags=re.M)
And then use the compiled regex to get matches like:
matchobj = compiled.search(string)
The documentation says that the compiled versions of the most recent patterns passed to re.compile()
and the module-level matching functions are cached, so programs that use only a few regular expressions at a time neednât worry. In most use-cases, you probably wonât need it but if youâre using complex regular expressions multiple times, you should definitely take advantage of compilation.
Thatâs about it from me, and if you read the docs, you are now officially better than me at regex (admittedly not a high bar to meet). Keep in mind that the re
module is a treasure trove of helpful functions that will do your job for you, very simple but important things like substitution and escaping. So, go on and spread the regex movement and remember to tell people to not write regex parsers for XHTML.
This is what happened to the last person who wrote regex to do that.
- Want to be a new developer? See New Developers
- Want to interact with people of the Wikimedia Outreach community? Come visit us at Zulipchat.
- Want to begin learning Rust? Read The Book.
Do let me know in the comments if you have any suggestions! Next time, a progress report. đ