You should probably not be using regular expressions

  • HTML is not regular
  • Regexes may match today, but what about tomorrow?

Say you've got a file of HTML where you're trying to extract URLs from <img> tags.

<img src="http://example.com/whatever.jpg">

So you write a regex like this (in Perl):

if ( $html =~ /<img src="(.+)"/ ) {
    $url = $1;
}

In this case, $url will indeed contain http://example.com/whatever.jpg. But what happens when you start getting HTML like this:

<img src='http://example.com/whatever.jpg'>

or

<img src=http://example.com/whatever.jpg>

or

<img border=0 src="http://example.com/whatever.jpg">

or

<img
    src="http://example.com/whatever.jpg">

or you start getting false positives from

<!-- <img src="http://example.com/outdated.png"> -->

Don't reinvent the wheel

Parsers are pieces of code that already work, already have been tested.

Your regex probably doesn't have everything worked out. Parsers have solutions for edge cases built in.

Why not parse with regexes?

You can't reliably parse HTML with regexes. Regular expressions are a tool that is insufficiently sophisticated to understand the constructs employed by HTML. HTML is not a regular language and hence cannot be parsed by regular expressions. Regex queries are not equipped to break down HTML into its meaningful parts. HTML is a language of sufficient complexity that it cannot be parsed by regular expressions.

Fork me on GitHub