Perl uses HTML::Parser

The CPAN module HTML::Parser is the basis for all HTML parsing in Perl. There are other CPAN modules that do parsing, but the vast majority of them are just wrappers around HTML::Parser.

Marpa::HTML

Marpa::HTML does "high-level" parsing of HTML. It allows handlers to be specified for elements, terminals and other components in the hierarchical structure of an HTML document. It's a is a completely liberal HTML parser: it never rejects a document, no matter how poorly that document fits the HTML standards.

The parsing method Marpa::HTML uses is totally new, as described in "How to Parse HTML", Parts one, two and three. Its Marpa::XS parse engine is in optimized C.

WWW::Mechanize

WWW::Mechanize is a handy module because it handles two common tasks associated with parsing HTML: fetching a remote document and extracting basic information from a document.

# Fetch the document located at $url
$mech->get( $url );

Calling the get() subroutine handles all the lower level work of using LWP to fetch a page and then HTML::Parser to build up a useful object. This $mech object has numerous subroutines for accessing all of the data or in a piecemeal fashion.

# Get the text from the current object
my $text = $mech->text();

# Return all links
my $links = $mech->links();

# Return all images
my $images = $mech->images();

# Fetch the page title
my $title = $mech->title();

WWW::Mechanize also provides find_all_links() and find_all_images() for searching through all the links and images that match a certain criteria, such as:

# Find all links with link text of "Download"
my @links = $mech->find_all_links( text => 'Download' );

# Find all links that look like they might be download
my @links = $mech->find_all_links( url_regex => qr/download/i );

WWW::Mechanize::TreeBuilder

WWW::Mechanize::TreeBuilder is a combination of WWW::Mechanize and WWW::TreeBuilder which brings the functionality of HTML::Element with it. Now it is possible to search by tag name or by attribute.

use v5.10;
use WWW::Mechanize;
use WWW::Mechanize::TreeBuilder;

my $mech = WWW::Mechanize->new;
WWW::Mechanize::TreeBuilder->meta->apply($mech);

$mech->get( 'http://htmlparsing.com/' );

# Find all <h1> tags
my @list = $mech->find('h1');

# or this way
my @list = $mech->look_down('_tag', 'h1');

# Now just iterate and process
foreach (@list) {
    say $_->as_text();
}

find() searches by tag name whereas look_down() starts at $mech and looks thru its element descendants (in pre-order), looking for elements matching the criteria you specify. In the above example we are using the internal attribute value _tag to search for <h1> tags only. look_down() can use HTML attribute names, values or be passed a coderef.

xmlgrep

The XML::Twig module includes the xmlgrep utility, which can often be good enough. It doesn't parse, but finds local matches.

To do

  • Code examples
  • Other modules than HTML::Parser
Fork me on GitHub