Using the HTML::Parser module
Tassilo v. Parseval
Newcomers to Perl often want to know how to parse HTML. For instance, to
extract the text between between <p> and </p> tags, or to extract content
by assembling and following hyperlinks.
HTML is treacherous in that in looks as though it could be handled with just a
few regular expressions. Even when you slurp the whole file and work on
large strings, sooner or later regular expressions won’t be enough.
The HTML::Parser module provides powerful mechanisms for extracting content,
tags and tag attributes from any html stream.
Subclassing
The subclassing approach that HTML::Parser offers
is worth knowing as it is a general technique (used by other Perl modules
as well). The idea behind it requires only a bit of understanding of OOP
concepts.
HTML::Parser is a class that provides a few methods that you will be
using verbatim, such as parse()
, parse_file()
or parse_chunk()
. What
they do is walk through the HTML and once they have identified a
certain HTML construct (a start or end tag, plain text etc.) they
trigger methods (they are a bit like callbacks) and pass them the stuff
they have identified. Those callback methods are the one you have to
provide.
In order to make this whole thing work, you create a subclass of
HTML::Parser. This subclass will inherit all the methods from
HTML::Parser (most notably the various parse()
functions). Some methods
however you will have to override (that is: replace them so that they
suit your needs). Quite naturally, it makes sense to override the
callbacks because those are the parts you want to customize.
So take this subclass:
That’s a fully functional subclass of HTML::Parser. Now you create an
object of this class and see what happens when it parses a file:
When you put the two code fragments above in a file and run it, you’ll notice
that nothing appears to be happening.
But you’ll also notice that you don’t get any errors like calling
non-existent functions. That’s because you call two methods on $parser
that were inherited from HTML::Parser, namely new()
and parse_file()
.
Further above I said that parse_file()
would trigger those callbacks,
but seemingly it doesn’t do that (because nothing is happening). But
actually, MyParser::parse_file()
does call them. As you did not override
them, it calls the default methods HTML::Parser::start/end/text/etc
(after all, those methods were inherited by ‘MyParser’). Those methods
are empty (which can be confirmed when you have a look at the source
code of HTML/Parser.pm.
Providing methods
In order to make your parser do something useful, you provide those
methods yourself:
In the previous example we used the parse_file()
method to parse the html in
the file “file.html”, but here for clarity we use the parse()
method to parse
the html contained in the $html variable.
So it appears MyParser::text()
has been called 7 times (7 apparently
because HTML::Parser also considers white-space), start()
and
end()
four times each (which makes sense: you have <html>, <head>,
<title> and <body> plus their corresponding closing tags).
The above parser only does counting. But the callback methods, (ie: text()
,
start()
and end()
), are called with arguments, (which we chose to ignore above).
The first argument is always the ‘MyParser’ object (as always with perl
methods). The additional arguments are those you are really
interested in: They are the broken down elements of HTML.
Next Parser
The above is essentially a cheap link extractor. The interesting part is
the start-callback:
It is called with five arguments. $self is the object itself, $tagname
is the name of the start tag, $attr is a hash-reference containing the
attributes as key/value pairs, $attrseq is an array-reference which
lists the attribute keys in the order in which they appeared in the tag,
and $origtext is eventually the original text as it appeared in the
HTML snippet.
The start-callback will be called four times for the given HTML string.
It will only do something when it encounters an <a> tag:
In this case it looks up the value of the ‘href’ attribute:
Additionally, it prints all the attributes in the order in which they
appeared:
For the first <a> tag, this is “href target”. For the second one, only
“href”.
You simply ignore all the stuff you are not
interested in. The above parser doesn’t care about end-tags or plain
text. It only looks at the start-tags to find links in the HTML
document.
It’s quite easy to integrate more complicated logic into a parser. For
instance if you need to parse other documents when they are referenced in an
attribute. Likewise, this parser can be made to work recursively:
Whenever it encounters a link to another document, it retrieves this
document, parses it for more links and follows them as well (until it
has walked through the whole www ;-):
This parser will probably never stop because it doesn’t keep track of
the websites it has already parsed. However, it’s not very hard to prevent
infinite recursion:
Conclusion
Hopefully the above is already all you need to write your first HTML::Parser
based program. It takes a little time to get used to event-based
approaches so you might want to experiment a bit with it. Once you have
grokked it, you’ll realize how convenient and powerful HTML::Parser is.
Tassilo