Retreiving web pages (LWP)
In this tutorial you will learn how to retreive the source for web pages. The first example covers simply retreiving the page and storing it either in a variable or a file. The second example shows the more complex possibilities available.
Solution 1: LWP::Simple
This first example uses the very friendly LWP::Simple module. This module allows you to request a url and either store the HTML in a variable, print it, or write it to a file.
In this example we are retreiving the HTML to a variable:
The LWP::Simple module provides only a functional interface – that is, there is no object oriented interface to use.
You can also use LWP::Simple to print the web page source directly to STDOUT. It is exactly the same as the previous example except we use getprint
instead of get
.
The third example shows how to get the web page source and write it directly to a file, using LWP::Simple. It uses the getstore
method that outputs the web page source directly to the given filename:
Solution 2: LWP
If you want to do more with the web page source than store it, you may want to consider using the full object oriented LWP::UserAgent interface. The package Bundle::LWP contains the standard LWP modules that you will need.
Firstly, to start your script:
For the Lazy (this is a good thing), you most likely also want to use:
You can export the GET method if you do not need POST.
Define your User Agent
You then need to define your user agent:
This is the object that acts as a browser and makes requests and receives repsonses.
Define the request
Next you need to create the request object that will be used to request the url. Since we are using the HTTP::Request::Common module, we can use the exported POST
method. It accepts a URL as its first parameter, and a list of arguments to be passed to the url (e.g. form arguments).
Or passing in form arguments:
The GET
method is used in a similar way to the first example:
You can also pass header data to the GET
and POST
methods.
Making the request
Once you have defined your request object, use the UserAgent to make the request:
The request
method returns a HTTP::Response object. This object contains the status code of the response, and the content of the page if the request was successful.
The response
You can check if the request was successful by using the is_success
method:
User Agents
If you want your program to be represented as a particular agent, for example Mozilla 8.0, you can set this using the agent
method:
Or, for example, an Internet Explorer example:
Proxies
For whatever reason, you may want your requests to be made through a proxy. You can set different proxies for different protocols. Here is an example of setting a proxy for the ftp protocol:
Cookies
Sometimes you will want your program to store the cookies created by retreived web pages. The LWP bundle provides a HTTP::Cookies module that will handle cookies for you. You need to use this module:
And then set up a cookie_jar:
LWP User Agent will now automatically store the cookies in the specified file, and they cookies will be available to future requests.
SSL
If you are requesting any urls using the SSL protocol (for example, a https page) you will first need to install an appropriate SSL module. The two modules currently supported by LWP are Crypt::SSLeay and IO::Socket::SSL. The Crypt::SSLeay module is prefered. Once you have installed either of these modules, you can request SSL encrypted urls just like other urls.
Working example
Below is a working script that requests a url and, if successful, prints the contents to standard out.