C# Parsing HTML with regular expressions

Regular expressions are notoriously difficult. But its just like programming. Any errors or bugs produced end up being the fault of the programmer. Good intentions give way to some small overlooked issue.

via GIPHY

Parsing HTML

There are a lot of cool utilities for parsing HTML in C#. One of the most popular one is the HTML Agility Pack which lets you do all kinds of things with nice syntax.

Or, if you’re dumb like me, you can do everything with regular expressions.

Pretty simple example – we want to just take the example HTML and parse out the first anchor tag’s url and text. We could then use match.NextMatch() to continue matching and get the text out of the next anchor tag.

So our Regex is set up: we search for anything .* before the <a href=, and then capture the value between the quotation marks. Continuing on, we match the end of the anchor tag, capture the text between the opening and closing tags, and then we’re done!

Let’s run it and…

… Oh. It grabbed the last anchor tag.

So I do the reasonable thing and check things out in my favorite Regex playground, Regex101.com

Well… that looks… like it should work… The first match is Google, like we expect. But C# matches Bing?

Regex Engines

There’s some minor differences when parsing regular expressions across different languages. Regex101 can use different engines (or “flavors”) to evaluate the expressions. By default, it is using PHP, but they also have support for JavaScript, Python, and GoLang. Note, C# is not an option.

So, C# is doing something different.

Lazy vs. Greedy

The issue, it turns out, lies in the .* – this matches any character (.) zero or more times (*). Simple enough, but this is also a greedy match. That means that the regex engine will take as many characters as possible that satisfies the match. In our case, that means it will skip everything until it finds the last <a href= instead of the first <a href=.
Luckily, we can switch to using a Lazy match by switching things over from .* to .*? which will grab the first / closest thing it finds and be done. This is what we want, so we can see the results:

So I guess now you could come in on Saturday and parse HTML with regular expressions… that would be great.

via GIPHY

Parsing HTML

Regex Engines

Lazy vs. Greedy

Leave a Reply Cancel reply