20111021 using perl to retrieve web pages - plembo/onemoretech GitHub Wiki

title: Using perl to retrieve web pages link: https://onemoretech.wordpress.com/2011/10/21/using-perl-to-retrieve-web-pages/ author: lembobro description: post_id: 1373 created: 2011/10/21 17:02:37 created_gmt: 2011/10/21 21:02:37 comment_status: closed post_name: using-perl-to-retrieve-web-pages status: publish post_type: post

Using perl to retrieve web pages

Here are a couple of examples. Both retrieve web pages where the site is secured by HTTPS and some kind of authentication. The first assumes the page is protected by a login form, the second by Basic Authentication. Enjoy. The basic code for this was stolen from IBM's Bret Swedeen's very excellent article, Write a Perl script to automate Web-based logins on the developerworks site. It uses WWW::Mechanize and HTTP::Cookies (and all their dependencies) to get the job done (I had to reinstall most of those modules from source on my RHEL 5 desktop at work because the packaged versions were too old -- going to try this on RHEL 6 tonight to see if that can be avoided on a newer distro). Cookie handling is essential when doing secure authentication. This is the script to handle form based authentication:

#!/usr/bin/perl
use strict;
use WWW::Mechanize;
use HTTP::Cookies;

my $HOME = $ENV{'HOME'};
my $outfile = "$HOME/retrieved_page.html";
my $hostname = "www.example.com";
my $urlpath = "mytools";

my $url = "https://$hostname/$urlpath/";
my $formpage = "login.html";
my $username = "[myname]";
my $password = "[mypass]";

my $mech = WWW::Mechanize->new();
$mech->cookie_jar(HTTP::Cookies->new());
$mech->get($url);
$mech->form_name($formpage);
$mech->field(uname => $username);
$mech->field(upass => $password);
$mech->click();
	
my $output_page = $mech->content();

open(OUTFILE, ">$outfile");
print OUTFILE "$output_page";
close(OUTFILE);

__END__;

A couple of things I want to point out about this example. First, the fake uri would be "https://www.example.com/mytools" and the uri for the login form, "https://www.example.com/mytools/login.html". Second, [myname] and [mypass] would clearly be substituted by real values like "myself" and "av0cado" (damn! another perfectly clever password lost to the Internets!). Third, and probably most important for this type of authentication, the field names you would plug in (in the example 'uname' and 'upass') need to be what the actual form has as the 'name' of each field in the HTML form. You can discover what this is by looking at the HTML source in your browser. Here is the Basic Authentication script. This probably could have been done with LWP::UserAgent alone, but I didn't have the time to figure that out so I just modified the original script to make it work:

my $HOME = $ENV{'HOME'};
my $outfile = "$HOME/retrieved_page.html";
my $hostname = "www.example.com";
my $urlpath = "mydocs";

my $url = "https://$hostname/$urlpath/";
my $username = "[myname]";
my $password = "[mypass]";
my $outfile = "$HOME/retrieved_page.html";
my $url = "https://$hostname/$urlpath/";

my $mech = WWW::Mechanize->new();
$mech->cookie_jar(HTTP::Cookies->new());
$mech->credentials($username =>$password);
$mech->get($url);
	
my $output_page = $mech->content();

open(OUTFILE, ">$outfile");
print OUTFILE "$output_page";
close(OUTFILE);

__END__;

Notice that here you don't need to know anything about the Basic Authentication elements (user name and password attributes, realm, etc.), the module deals with that for you. The other "$mech->click()" is no longer necessary when dealing with this kind of authentication.

Copyright 2004-2019 Phil Lembo