Regular expressions are notoriously difficult. But its just like programming. Any errors or bugs produced end up being the fault of the programmer. Good intentions give way to some small overlooked issue.
Parsing HTML
There are a lot of cool utilities for parsing HTML in C#. One of the most popular one is the HTML Agility Pack which lets you do all kinds of things with nice syntax.
Or, if you’re dumb like me, you can do everything with regular expressions.
Pretty simple example – we want to just take the example HTML and parse out the first anchor tag’s url and text. We could then use match.NextMatch() to continue matching and get the text out of the next anchor tag.
So our Regex is set up: we search for anything .* before the <a href=, and then capture the value between the quotation marks. Continuing on, we match the end of the anchor tag, capture the text between the opening and closing tags, and then we’re done!
Let’s run it and…
… Oh. It grabbed the last anchor tag.
So I do the reasonable thing and check things out in my favorite Regex playground, Regex101.com
Well… that looks… like it should work… The first match is Google, like we expect. But C# matches Bing?
Regex Engines
There’s some minor differences when parsing regular expressions across different languages. Regex101 can use different engines (or “flavors”) to evaluate the expressions. By default, it is using PHP, but they also have support for JavaScript, Python, and GoLang. Note, C# is not an option.
So, C# is doing something different.
Lazy vs. Greedy
The issue, it turns out, lies in the .* – this matches any character (.) zero or more times (*). Simple enough, but this is also a greedy match. That means that the regex engine will take as many characters as possible that satisfies the match. In our case, that means it will skip everything until it finds the last <a href= instead of the first <a href=.
Luckily, we can switch to using a Lazy match by switching things over from .* to .*? which will grab the first / closest thing it finds and be done. This is what we want, so we can see the results:
So I guess now you could come in on Saturday and parse HTML with regular expressions… that would be great.