관리-도구
편집 파일: lwptut.3pm
.\" Automatically generated by Pod::Man 2.27 (Pod::Simple 3.28) .\" .\" Standard preamble: .\" ======================================================================== .de Sp \" Vertical space (when we can't use .PP) .if t .sp .5v .if n .sp .. .de Vb \" Begin verbatim text .ft CW .nf .ne \\$1 .. .de Ve \" End verbatim text .ft R .fi .. .\" Set up some character translations and predefined strings. \*(-- will .\" give an unbreakable dash, \*(PI will give pi, \*(L" will give a left .\" double quote, and \*(R" will give a right double quote. \*(C+ will .\" give a nicer C++. Capital omega is used to do unbreakable dashes and .\" therefore won't be available. \*(C` and \*(C' expand to `' in nroff, .\" nothing in troff, for use with C<>. .tr \(*W- .ds C+ C\v'-.1v'\h'-1p'\s-2+\h'-1p'+\s0\v'.1v'\h'-1p' .ie n \{\ . ds -- \(*W- . ds PI pi . if (\n(.H=4u)&(1m=24u) .ds -- \(*W\h'-12u'\(*W\h'-12u'-\" diablo 10 pitch . if (\n(.H=4u)&(1m=20u) .ds -- \(*W\h'-12u'\(*W\h'-8u'-\" diablo 12 pitch . ds L" "" . ds R" "" . ds C` "" . ds C' "" 'br\} .el\{\ . ds -- \|\(em\| . ds PI \(*p . ds L" `` . ds R" '' . ds C` . ds C' 'br\} .\" .\" Escape single quotes in literal strings from groff's Unicode transform. .ie \n(.g .ds Aq \(aq .el .ds Aq ' .\" .\" If the F register is turned on, we'll generate index entries on stderr for .\" titles (.TH), headers (.SH), subsections (.SS), items (.Ip), and index .\" entries marked with X<> in POD. Of course, you'll have to process the .\" output yourself in some meaningful fashion. .\" .\" Avoid warning from groff about undefined register 'F'. .de IX .. .nr rF 0 .if \n(.g .if rF .nr rF 1 .if (\n(rF:(\n(.g==0)) \{ . if \nF \{ . de IX . tm Index:\\$1\t\\n%\t"\\$2" .. . if !\nF==2 \{ . nr % 0 . nr F 2 . \} . \} .\} .rr rF .\" .\" Accent mark definitions (@(#)ms.acc 1.5 88/02/08 SMI; from UCB 4.2). .\" Fear. Run. Save yourself. No user-serviceable parts. . \" fudge factors for nroff and troff .if n \{\ . ds #H 0 . ds #V .8m . ds #F .3m . ds #[ \f1 . ds #] \fP .\} .if t \{\ . ds #H ((1u-(\\\\n(.fu%2u))*.13m) . ds #V .6m . ds #F 0 . ds #[ \& . ds #] \& .\} . \" simple accents for nroff and troff .if n \{\ . ds ' \& . ds ` \& . ds ^ \& . ds , \& . ds ~ ~ . ds / .\} .if t \{\ . ds ' \\k:\h'-(\\n(.wu*8/10-\*(#H)'\'\h"|\\n:u" . ds ` \\k:\h'-(\\n(.wu*8/10-\*(#H)'\`\h'|\\n:u' . ds ^ \\k:\h'-(\\n(.wu*10/11-\*(#H)'^\h'|\\n:u' . ds , \\k:\h'-(\\n(.wu*8/10)',\h'|\\n:u' . ds ~ \\k:\h'-(\\n(.wu-\*(#H-.1m)'~\h'|\\n:u' . ds / \\k:\h'-(\\n(.wu*8/10-\*(#H)'\z\(sl\h'|\\n:u' .\} . \" troff and (daisy-wheel) nroff accents .ds : \\k:\h'-(\\n(.wu*8/10-\*(#H+.1m+\*(#F)'\v'-\*(#V'\z.\h'.2m+\*(#F'.\h'|\\n:u'\v'\*(#V' .ds 8 \h'\*(#H'\(*b\h'-\*(#H' .ds o \\k:\h'-(\\n(.wu+\w'\(de'u-\*(#H)/2u'\v'-.3n'\*(#[\z\(de\v'.3n'\h'|\\n:u'\*(#] .ds d- \h'\*(#H'\(pd\h'-\w'~'u'\v'-.25m'\f2\(hy\fP\v'.25m'\h'-\*(#H' .ds D- D\\k:\h'-\w'D'u'\v'-.11m'\z\(hy\v'.11m'\h'|\\n:u' .ds th \*(#[\v'.3m'\s+1I\s-1\v'-.3m'\h'-(\w'I'u*2/3)'\s-1o\s+1\*(#] .ds Th \*(#[\s+2I\s-2\h'-\w'I'u*3/5'\v'-.3m'o\v'.3m'\*(#] .ds ae a\h'-(\w'a'u*4/10)'e .ds Ae A\h'-(\w'A'u*4/10)'E . \" corrections for vroff .if v .ds ~ \\k:\h'-(\\n(.wu*9/10-\*(#H)'\s-2\u~\d\s+2\h'|\\n:u' .if v .ds ^ \\k:\h'-(\\n(.wu*10/11-\*(#H)'\v'-.4m'^\v'.4m'\h'|\\n:u' . \" for low resolution devices (crt and lpr) .if \n(.H>23 .if \n(.V>19 \ \{\ . ds : e . ds 8 ss . ds o a . ds d- d\h'-1'\(ga . ds D- D\h'-1'\(hy . ds th \o'bp' . ds Th \o'LP' . ds ae ae . ds Ae AE .\} .rm #[ #] #H #V #F C .\" ======================================================================== .\" .IX Title "lwptut 3" .TH lwptut 3 "2023-02-27" "perl v5.16.3" "User Contributed Perl Documentation" .\" For nroff, turn off justification. Always turn off hyphenation; it makes .\" way too many mistakes in technical documents. .if n .ad l .nh .SH "NAME" lwptut \-\- An LWP Tutorial .SH "DESCRIPTION" .IX Header "DESCRIPTION" \&\s-1LWP \s0(short for \*(L"Library for \s-1WWW\s0 in Perl\*(R") is a very popular group of Perl modules for accessing data on the Web. Like most Perl module-distributions, each of \s-1LWP\s0's component modules comes with documentation that is a complete reference to its interface. However, there are so many modules in \s-1LWP\s0 that it's hard to know where to start looking for information on how to do even the simplest most common things. .PP Really introducing you to using \s-1LWP\s0 would require a whole book \*(-- a book that just happens to exist, called \fIPerl & \s-1LWP\s0\fR. But this article should give you a taste of how you can go about some common tasks with \&\s-1LWP.\s0 .SS "Getting documents with LWP::Simple" .IX Subsection "Getting documents with LWP::Simple" If you just want to get what's at a particular \s-1URL,\s0 the simplest way to do it is LWP::Simple's functions. .PP In a Perl program, you can call its \f(CW\*(C`get($url)\*(C'\fR function. It will try getting that \s-1URL\s0's content. If it works, then it'll return the content; but if there's some error, it'll return undef. .PP .Vb 2 \& my $url = \*(Aqhttp://www.npr.org/programs/fa/?todayDate=current\*(Aq; \& # Just an example: the URL for the most recent /Fresh Air/ show \& \& use LWP::Simple; \& my $content = get $url; \& die "Couldn\*(Aqt get $url" unless defined $content; \& \& # Then go do things with $content, like this: \& \& if($content =~ m/jazz/i) { \& print "They\*(Aqre talking about jazz today on Fresh Air!\en"; \& } \& else { \& print "Fresh Air is apparently jazzless today.\en"; \& } .Ve .PP The handiest variant on \f(CW\*(C`get\*(C'\fR is \f(CW\*(C`getprint\*(C'\fR, which is useful in Perl one-liners. If it can get the page whose \s-1URL\s0 you provide, it sends it to \s-1STDOUT\s0; otherwise it complains to \s-1STDERR.\s0 .PP .Vb 1 \& % perl \-MLWP::Simple \-e "getprint \*(Aqhttp://www.cpan.org/RECENT\*(Aq" .Ve .PP That is the \s-1URL\s0 of a plain text file that lists new files in \s-1CPAN\s0 in the past two weeks. You can easily make it part of a tidy little shell command, like this one that mails you the list of new \&\f(CW\*(C`Acme::\*(C'\fR modules: .PP .Vb 2 \& % perl \-MLWP::Simple \-e "getprint \*(Aqhttp://www.cpan.org/RECENT\*(Aq" \e \& | grep "/by\-module/Acme" | mail \-s "New Acme modules! Joy!" $USER .Ve .PP There are other useful functions in LWP::Simple, including one function for running a \s-1HEAD\s0 request on a \s-1URL \s0(useful for checking links, or getting the last-revised time of a \s-1URL\s0), and two functions for saving/mirroring a \s-1URL\s0 to a local file. See the LWP::Simple documentation for the full details, or chapter 2 of \fIPerl & \s-1LWP\s0\fR for more examples. .SS "The Basics of the \s-1LWP\s0 Class Model" .IX Subsection "The Basics of the LWP Class Model" LWP::Simple's functions are handy for simple cases, but its functions don't support cookies or authorization, don't support setting header lines in the \s-1HTTP\s0 request, generally don't support reading header lines in the \s-1HTTP\s0 response (notably the full \s-1HTTP\s0 error message, in case of an error). To get at all those features, you'll have to use the full \s-1LWP\s0 class model. .PP While \s-1LWP\s0 consists of dozens of classes, the main two that you have to understand are LWP::UserAgent and HTTP::Response. LWP::UserAgent is a class for \*(L"virtual browsers\*(R" which you use for performing requests, and HTTP::Response is a class for the responses (or error messages) that you get back from those requests. .PP The basic idiom is \f(CW\*(C`$response = $browser\->get($url)\*(C'\fR, or more fully illustrated: .PP .Vb 1 \& # Early in your program: \& \& use LWP 5.64; # Loads all important LWP classes, and makes \& # sure your version is reasonably recent. \& \& my $browser = LWP::UserAgent\->new; \& \& ... \& \& # Then later, whenever you need to make a get request: \& my $url = \*(Aqhttp://www.npr.org/programs/fa/?todayDate=current\*(Aq; \& \& my $response = $browser\->get( $url ); \& die "Can\*(Aqt get $url \-\- ", $response\->status_line \& unless $response\->is_success; \& \& die "Hey, I was expecting HTML, not ", $response\->content_type \& unless $response\->content_type eq \*(Aqtext/html\*(Aq; \& # or whatever content\-type you\*(Aqre equipped to deal with \& \& # Otherwise, process the content somehow: \& \& if($response\->decoded_content =~ m/jazz/i) { \& print "They\*(Aqre talking about jazz today on Fresh Air!\en"; \& } \& else { \& print "Fresh Air is apparently jazzless today.\en"; \& } .Ve .PP There are two objects involved: \f(CW$browser\fR, which holds an object of class LWP::UserAgent, and then the \f(CW$response\fR object, which is of class HTTP::Response. You really need only one browser object per program; but every time you make a request, you get back a new HTTP::Response object, which will have some interesting attributes: .IP "\(bu" 4 A status code indicating success or failure (which you can test with \f(CW\*(C`$response\->is_success\*(C'\fR). .IP "\(bu" 4 An \s-1HTTP\s0 status line that is hopefully informative if there's failure (which you can see with \f(CW\*(C`$response\->status_line\*(C'\fR, returning something like \*(L"404 Not Found\*(R"). .IP "\(bu" 4 A \s-1MIME\s0 content-type like \*(L"text/html\*(R", \*(L"image/gif\*(R", \&\*(L"application/xml\*(R", etc., which you can see with \&\f(CW\*(C`$response\->content_type\*(C'\fR .IP "\(bu" 4 The actual content of the response, in \f(CW\*(C`$response\->decoded_content\*(C'\fR. If the response is \s-1HTML,\s0 that's where the \s-1HTML\s0 source will be; if it's a \s-1GIF,\s0 then \f(CW\*(C`$response\->decoded_content\*(C'\fR will be the binary \&\s-1GIF\s0 data. .IP "\(bu" 4 And dozens of other convenient and more specific methods that are documented in the docs for HTTP::Response, and its superclasses HTTP::Message and HTTP::Headers. .SS "Adding Other \s-1HTTP\s0 Request Headers" .IX Subsection "Adding Other HTTP Request Headers" The most commonly used syntax for requests is \f(CW\*(C`$response = $browser\->get($url)\*(C'\fR, but in truth, you can add extra \s-1HTTP\s0 header lines to the request by adding a list of key-value pairs after the \s-1URL,\s0 like so: .PP .Vb 1 \& $response = $browser\->get( $url, $key1, $value1, $key2, $value2, ... ); .Ve .PP For example, here's how to send some commonly used headers, in case you're dealing with a site that would otherwise reject your request: .PP .Vb 6 \& my @ns_headers = ( \& \*(AqUser\-Agent\*(Aq => \*(AqMozilla/4.76 [en] (Win98; U)\*(Aq, \& \*(AqAccept\*(Aq => \*(Aqimage/gif, image/x\-xbitmap, image/jpeg, image/pjpeg, image/png, */*\*(Aq, \& \*(AqAccept\-Charset\*(Aq => \*(Aqiso\-8859\-1,*,utf\-8\*(Aq, \& \*(AqAccept\-Language\*(Aq => \*(Aqen\-US\*(Aq, \& ); \& \& ... \& \& $response = $browser\->get($url, @ns_headers); .Ve .PP If you weren't reusing that array, you could just go ahead and do this: .PP .Vb 6 \& $response = $browser\->get($url, \& \*(AqUser\-Agent\*(Aq => \*(AqMozilla/4.76 [en] (Win98; U)\*(Aq, \& \*(AqAccept\*(Aq => \*(Aqimage/gif, image/x\-xbitmap, image/jpeg, image/pjpeg, image/png, */*\*(Aq, \& \*(AqAccept\-Charset\*(Aq => \*(Aqiso\-8859\-1,*,utf\-8\*(Aq, \& \*(AqAccept\-Language\*(Aq => \*(Aqen\-US\*(Aq, \& ); .Ve .PP If you were only ever changing the 'User\-Agent' line, you could just change the \f(CW$browser\fR object's default line from \*(L"libwww\-perl/5.65\*(R" (or the like) to whatever you like, using the LWP::UserAgent \f(CW\*(C`agent\*(C'\fR method: .PP .Vb 1 \& $browser\->agent(\*(AqMozilla/4.76 [en] (Win98; U)\*(Aq); .Ve .SS "Enabling Cookies" .IX Subsection "Enabling Cookies" A default LWP::UserAgent object acts like a browser with its cookies support turned off. There are various ways of turning it on, by setting its \f(CW\*(C`cookie_jar\*(C'\fR attribute. A \*(L"cookie jar\*(R" is an object representing a little database of all the \s-1HTTP\s0 cookies that a browser knows about. It can correspond to a file on disk or an in-memory object that starts out empty, and whose collection of cookies will disappear once the program is finished running. .PP To give a browser an in-memory empty cookie jar, you set its \f(CW\*(C`cookie_jar\*(C'\fR attribute like so: .PP .Vb 2 \& use HTTP::CookieJar::LWP; \& $browser\->cookie_jar( HTTP::CookieJar::LWP\->new ); .Ve .PP To save a cookie jar to disk, see \*(L"dump_cookies\*(R" in HTTP::CookieJar. To load cookies from disk into a jar, see \*(L"load_cookies\*(R" in HTTP::CookieJar. .SS "Posting Form Data" .IX Subsection "Posting Form Data" Many \s-1HTML\s0 forms send data to their server using an \s-1HTTP POST\s0 request, which you can send with this syntax: .PP .Vb 7 \& $response = $browser\->post( $url, \& [ \& formkey1 => value1, \& formkey2 => value2, \& ... \& ], \& ); .Ve .PP Or if you need to send \s-1HTTP\s0 headers: .PP .Vb 9 \& $response = $browser\->post( $url, \& [ \& formkey1 => value1, \& formkey2 => value2, \& ... \& ], \& headerkey1 => value1, \& headerkey2 => value2, \& ); .Ve .PP For example, the following program makes a search request to AltaVista (by sending some form data via an \s-1HTTP POST\s0 request), and extracts from the \s-1HTML\s0 the report of the number of matches: .PP .Vb 4 \& use strict; \& use warnings; \& use LWP 5.64; \& my $browser = LWP::UserAgent\->new; \& \& my $word = \*(Aqtarragon\*(Aq; \& \& my $url = \*(Aqhttp://search.yahoo.com/yhs/search\*(Aq; \& my $response = $browser\->post( $url, \& [ \*(Aqq\*(Aq => $word, # the Altavista query string \& \*(Aqfr\*(Aq => \*(Aqaltavista\*(Aq, \*(Aqpg\*(Aq => \*(Aqq\*(Aq, \*(Aqavkw\*(Aq => \*(Aqtgz\*(Aq, \*(Aqkl\*(Aq => \*(AqXX\*(Aq, \& ] \& ); \& die "$url error: ", $response\->status_line \& unless $response\->is_success; \& die "Weird content type at $url \-\- ", $response\->content_type \& unless $response\->content_is_html; \& \& if( $response\->decoded_content =~ m{([0\-9,]+)(?:<.*?>)? results for} ) { \& # The substring will be like "996,000</strong> results for" \& print "$word: $1\en"; \& } \& else { \& print "Couldn\*(Aqt find the match\-string in the response\en"; \& } .Ve .SS "Sending \s-1GET\s0 Form Data" .IX Subsection "Sending GET Form Data" Some \s-1HTML\s0 forms convey their form data not by sending the data in an \s-1HTTP POST\s0 request, but by making a normal \s-1GET\s0 request with the data stuck on the end of the \s-1URL. \s0 For example, if you went to \&\f(CW\*(C`www.imdb.com\*(C'\fR and ran a search on \*(L"Blade Runner\*(R", the \s-1URL\s0 you'd see in your browser window would be: .PP .Vb 1 \& http://www.imdb.com/find?s=all&q=Blade+Runner .Ve .PP To run the same search with \s-1LWP,\s0 you'd use this idiom, which involves the \s-1URI\s0 class: .PP .Vb 3 \& use URI; \& my $url = URI\->new( \*(Aqhttp://www.imdb.com/find\*(Aq ); \& # makes an object representing the URL \& \& $url\->query_form( # And here the form data pairs: \& \*(Aqq\*(Aq => \*(AqBlade Runner\*(Aq, \& \*(Aqs\*(Aq => \*(Aqall\*(Aq, \& ); \& \& my $response = $browser\->get($url); .Ve .PP See chapter 5 of \fIPerl & \s-1LWP\s0\fR for a longer discussion of \s-1HTML\s0 forms and of form data, and chapters 6 through 9 for a longer discussion of extracting data from \s-1HTML.\s0 .SS "Absolutizing URLs" .IX Subsection "Absolutizing URLs" The \s-1URI\s0 class that we just mentioned above provides all sorts of methods for accessing and modifying parts of URLs (such as asking sort of \s-1URL\s0 it is with \f(CW\*(C`$url\->scheme\*(C'\fR, and asking what host it refers to with \f(CW\*(C`$url\->host\*(C'\fR, and so on, as described in the docs for the \s-1URI\s0 class. However, the methods of most immediate interest are the \f(CW\*(C`query_form\*(C'\fR method seen above, and now the \f(CW\*(C`new_abs\*(C'\fR method for taking a probably-relative \s-1URL\s0 string (like \*(L"../foo.html\*(R") and getting back an absolute \s-1URL \s0(like \*(L"http://www.perl.com/stuff/foo.html\*(R"), as shown here: .PP .Vb 2 \& use URI; \& $abs = URI\->new_abs($maybe_relative, $base); .Ve .PP For example, consider this program that matches URLs in the \s-1HTML\s0 list of new modules in \s-1CPAN:\s0 .PP .Vb 4 \& use strict; \& use warnings; \& use LWP; \& my $browser = LWP::UserAgent\->new; \& \& my $url = \*(Aqhttp://www.cpan.org/RECENT.html\*(Aq; \& my $response = $browser\->get($url); \& die "Can\*(Aqt get $url \-\- ", $response\->status_line \& unless $response\->is_success; \& \& my $html = $response\->decoded_content; \& while( $html =~ m/<A HREF=\e"(.*?)\e"/g ) { \& print "$1\en"; \& } .Ve .PP When run, it emits output that starts out something like this: .PP .Vb 7 \& MIRRORING.FROM \& RECENT \& RECENT.html \& authors/00whois.html \& authors/01mailrc.txt.gz \& authors/id/A/AA/AASSAD/CHECKSUMS \& ... .Ve .PP However, if you actually want to have those be absolute URLs, you can use the \s-1URI\s0 module's \f(CW\*(C`new_abs\*(C'\fR method, by changing the \f(CW\*(C`while\*(C'\fR loop to this: .PP .Vb 3 \& while( $html =~ m/<A HREF=\e"(.*?)\e"/g ) { \& print URI\->new_abs( $1, $response\->base ) ,"\en"; \& } .Ve .PP (The \f(CW\*(C`$response\->base\*(C'\fR method from HTTP::Message is for returning what \s-1URL\s0 should be used for resolving relative URLs \*(-- it's usually just the same as the \s-1URL\s0 that you requested.) .PP That program then emits nicely absolute URLs: .PP .Vb 7 \& http://www.cpan.org/MIRRORING.FROM \& http://www.cpan.org/RECENT \& http://www.cpan.org/RECENT.html \& http://www.cpan.org/authors/00whois.html \& http://www.cpan.org/authors/01mailrc.txt.gz \& http://www.cpan.org/authors/id/A/AA/AASSAD/CHECKSUMS \& ... .Ve .PP See chapter 4 of \fIPerl & \s-1LWP\s0\fR for a longer discussion of \s-1URI\s0 objects. .PP Of course, using a regexp to match hrefs is a bit simplistic, and for more robust programs, you'll probably want to use an HTML-parsing module like HTML::LinkExtor or HTML::TokeParser or even maybe HTML::TreeBuilder. .SS "Other Browser Attributes" .IX Subsection "Other Browser Attributes" LWP::UserAgent objects have many attributes for controlling how they work. Here are a few notable ones: .IP "\(bu" 4 \&\f(CW\*(C`$browser\->timeout(15);\*(C'\fR .Sp This sets this browser object to give up on requests that don't answer within 15 seconds. .IP "\(bu" 4 \&\f(CW\*(C`$browser\->protocols_allowed( [ \*(Aqhttp\*(Aq, \*(Aqgopher\*(Aq] );\*(C'\fR .Sp This sets this browser object to not speak any protocols other than \s-1HTTP\s0 and gopher. If it tries accessing any other kind of \s-1URL \s0(like an \*(L"ftp:\*(R" or \*(L"mailto:\*(R" or \*(L"news:\*(R" \s-1URL\s0), then it won't actually try connecting, but instead will immediately return an error code 500, with a message like \&\*(L"Access to 'ftp' URIs has been disabled\*(R". .IP "\(bu" 4 \&\f(CW\*(C`use LWP::ConnCache; $browser\->conn_cache(LWP::ConnCache\->new());\*(C'\fR .Sp This tells the browser object to try using the \s-1HTTP/1.1 \s0\*(L"Keep-Alive\*(R" feature, which speeds up requests by reusing the same socket connection for multiple requests to the same server. .IP "\(bu" 4 \&\f(CW\*(C`$browser\->agent( \*(AqSomeName/1.23 (more info here maybe)\*(Aq )\*(C'\fR .Sp This changes how the browser object will identify itself in the default \*(L"User-Agent\*(R" line is its \s-1HTTP\s0 requests. By default, it'll send "libwww\-perl/\fIversionnumber\fR\*(L", like \&\*(R"libwww\-perl/5.65". You can change that to something more descriptive like this: .Sp .Vb 1 \& $browser\->agent( \*(AqSomeName/3.14 (contact@robotplexus.int)\*(Aq ); .Ve .Sp Or if need be, you can go in disguise, like this: .Sp .Vb 1 \& $browser\->agent( \*(AqMozilla/4.0 (compatible; MSIE 5.12; Mac_PowerPC)\*(Aq ); .Ve .IP "\(bu" 4 \&\f(CW\*(C`push @{ $ua\->requests_redirectable }, \*(AqPOST\*(Aq;\*(C'\fR .Sp This tells this browser to obey redirection responses to \s-1POST\s0 requests (like most modern interactive browsers), even though the \s-1HTTP RFC\s0 says that should not normally be done. .PP For more options and information, see the full documentation for LWP::UserAgent. .SS "Writing Polite Robots" .IX Subsection "Writing Polite Robots" If you want to make sure that your LWP-based program respects \fIrobots.txt\fR files and doesn't make too many requests too fast, you can use the LWP::RobotUA class instead of the LWP::UserAgent class. .PP LWP::RobotUA class is just like LWP::UserAgent, and you can use it like so: .PP .Vb 3 \& use LWP::RobotUA; \& my $browser = LWP::RobotUA\->new(\*(AqYourSuperBot/1.34\*(Aq, \*(Aqyou@yoursite.com\*(Aq); \& # Your bot\*(Aqs name and your email address \& \& my $response = $browser\->get($url); .Ve .PP But HTTP::RobotUA adds these features: .IP "\(bu" 4 If the \fIrobots.txt\fR on \f(CW$url\fR's server forbids you from accessing \&\f(CW$url\fR, then the \f(CW$browser\fR object (assuming it's of class LWP::RobotUA) won't actually request it, but instead will give you back (in \f(CW$response\fR) a 403 error with a message \*(L"Forbidden by robots.txt\*(R". That is, if you have this line: .Sp .Vb 2 \& die "$url \-\- ", $response\->status_line, "\enAborted" \& unless $response\->is_success; .Ve .Sp then the program would die with an error message like this: .Sp .Vb 2 \& http://whatever.site.int/pith/x.html \-\- 403 Forbidden by robots.txt \& Aborted at whateverprogram.pl line 1234 .Ve .IP "\(bu" 4 If this \f(CW$browser\fR object sees that the last time it talked to \&\f(CW$url\fR's server was too recently, then it will pause (via \f(CW\*(C`sleep\*(C'\fR) to avoid making too many requests too often. How long it will pause for, is by default one minute \*(-- but you can control it with the \f(CW\*(C`$browser\->delay( \f(CIminutes\f(CW )\*(C'\fR attribute. .Sp For example, this code: .Sp .Vb 1 \& $browser\->delay( 7/60 ); .Ve .Sp \&...means that this browser will pause when it needs to avoid talking to any given server more than once every 7 seconds. .PP For more options and information, see the full documentation for LWP::RobotUA. .SS "Using Proxies" .IX Subsection "Using Proxies" In some cases, you will want to (or will have to) use proxies for accessing certain sites and/or using certain protocols. This is most commonly the case when your \s-1LWP\s0 program is running (or could be running) on a machine that is behind a firewall. .PP To make a browser object use proxies that are defined in the usual environment variables (\f(CW\*(C`HTTP_PROXY\*(C'\fR, etc.), just call the \f(CW\*(C`env_proxy\*(C'\fR on a user-agent object before you go making any requests on it. Specifically: .PP .Vb 2 \& use LWP::UserAgent; \& my $browser = LWP::UserAgent\->new; \& \& # And before you go making any requests: \& $browser\->env_proxy; .Ve .PP For more information on proxy parameters, see the LWP::UserAgent documentation, specifically the \f(CW\*(C`proxy\*(C'\fR, \f(CW\*(C`env_proxy\*(C'\fR, and \f(CW\*(C`no_proxy\*(C'\fR methods. .SS "\s-1HTTP\s0 Authentication" .IX Subsection "HTTP Authentication" Many web sites restrict access to documents by using \*(L"\s-1HTTP\s0 Authentication\*(R". This isn't just any form of \*(L"enter your password\*(R" restriction, but is a specific mechanism where the \s-1HTTP\s0 server sends the browser an \s-1HTTP\s0 code that says \*(L"That document is part of a protected \&'realm', and you can access it only if you re-request it and add some special authorization headers to your request\*(R". .PP For example, the Unicode.org admins stop email-harvesting bots from harvesting the contents of their mailing list archives, by protecting them with \s-1HTTP\s0 Authentication, and then publicly stating the username and password (at \f(CW\*(C`http://www.unicode.org/mail\-arch/\*(C'\fR) \*(-- namely username \*(L"unicode-ml\*(R" and password \*(L"unicode\*(R". .PP For example, consider this \s-1URL,\s0 which is part of the protected area of the web site: .PP .Vb 1 \& http://www.unicode.org/mail\-arch/unicode\-ml/y2002\-m08/0067.html .Ve .PP If you access that with a browser, you'll get a prompt like \&\*(L"Enter username and password for 'Unicode\-MailList\-Archives' at server \&'www.unicode.org'\*(R". .PP In \s-1LWP,\s0 if you just request that \s-1URL,\s0 like this: .PP .Vb 2 \& use LWP; \& my $browser = LWP::UserAgent\->new; \& \& my $url = \& \*(Aqhttp://www.unicode.org/mail\-arch/unicode\-ml/y2002\-m08/0067.html\*(Aq; \& my $response = $browser\->get($url); \& \& die "Error: ", $response\->header(\*(AqWWW\-Authenticate\*(Aq) || \*(AqError accessing\*(Aq, \& # (\*(AqWWW\-Authenticate\*(Aq is the realm\-name) \& "\en ", $response\->status_line, "\en at $url\en Aborting" \& unless $response\->is_success; .Ve .PP Then you'll get this error: .PP .Vb 4 \& Error: Basic realm="Unicode\-MailList\-Archives" \& 401 Authorization Required \& at http://www.unicode.org/mail\-arch/unicode\-ml/y2002\-m08/0067.html \& Aborting at auth1.pl line 9. [or wherever] .Ve .PP \&...because the \f(CW$browser\fR doesn't know any the username and password for that realm (\*(L"Unicode-MailList-Archives\*(R") at that host (\*(L"www.unicode.org\*(R"). The simplest way to let the browser know about this is to use the \f(CW\*(C`credentials\*(C'\fR method to let it know about a username and password that it can try using for that realm at that host. The syntax is: .PP .Vb 5 \& $browser\->credentials( \& \*(Aqservername:portnumber\*(Aq, \& \*(Aqrealm\-name\*(Aq, \& \*(Aqusername\*(Aq => \*(Aqpassword\*(Aq \& ); .Ve .PP In most cases, the port number is 80, the default \s-1TCP/IP\s0 port for \s-1HTTP\s0; and you usually call the \f(CW\*(C`credentials\*(C'\fR method before you make any requests. For example: .PP .Vb 5 \& $browser\->credentials( \& \*(Aqreports.mybazouki.com:80\*(Aq, \& \*(Aqweb_server_usage_reports\*(Aq, \& \*(Aqplinky\*(Aq => \*(Aqbanjo123\*(Aq \& ); .Ve .PP So if we add the following to the program above, right after the \f(CW\*(C`$browser = LWP::UserAgent\->new;\*(C'\fR line... .PP .Vb 5 \& $browser\->credentials( # add this to our $browser \*(Aqs "key ring" \& \*(Aqwww.unicode.org:80\*(Aq, \& \*(AqUnicode\-MailList\-Archives\*(Aq, \& \*(Aqunicode\-ml\*(Aq => \*(Aqunicode\*(Aq \& ); .Ve .PP \&...then when we run it, the request succeeds, instead of causing the \&\f(CW\*(C`die\*(C'\fR to be called. .SS "Accessing \s-1HTTPS\s0 URLs" .IX Subsection "Accessing HTTPS URLs" When you access an \s-1HTTPS URL,\s0 it'll work for you just like an \s-1HTTP URL\s0 would \*(-- if your \s-1LWP\s0 installation has \s-1HTTPS\s0 support (via an appropriate Secure Sockets Layer library). For example: .PP .Vb 8 \& use LWP; \& my $url = \*(Aqhttps://www.paypal.com/\*(Aq; # Yes, HTTPS! \& my $browser = LWP::UserAgent\->new; \& my $response = $browser\->get($url); \& die "Error at $url\en ", $response\->status_line, "\en Aborting" \& unless $response\->is_success; \& print "Whee, it worked! I got that ", \& $response\->content_type, " document!\en"; .Ve .PP If your \s-1LWP\s0 installation doesn't have \s-1HTTPS\s0 support set up, then the response will be unsuccessful, and you'll get this error message: .PP .Vb 3 \& Error at https://www.paypal.com/ \& 501 Protocol scheme \*(Aqhttps\*(Aq is not supported \& Aborting at paypal.pl line 7. [or whatever program and line] .Ve .PP If your \s-1LWP\s0 installation \fIdoes\fR have \s-1HTTPS\s0 support installed, then the response should be successful, and you should be able to consult \&\f(CW$response\fR just like with any normal \s-1HTTP\s0 response. .PP For information about installing \s-1HTTPS\s0 support for your \s-1LWP\s0 installation, see the helpful \fI\s-1README.SSL\s0\fR file that comes in the libwww-perl distribution. .SS "Getting Large Documents" .IX Subsection "Getting Large Documents" When you're requesting a large (or at least potentially large) document, a problem with the normal way of using the request methods (like \f(CW\*(C`$response = $browser\->get($url)\*(C'\fR) is that the response object in memory will have to hold the whole document \*(-- \fIin memory\fR. If the response is a thirty megabyte file, this is likely to be quite an imposition on this process's memory usage. .PP A notable alternative is to have \s-1LWP\s0 save the content to a file on disk, instead of saving it up in memory. This is the syntax to use: .PP .Vb 3 \& $response = $ua\->get($url, \& \*(Aq:content_file\*(Aq => $filespec, \& ); .Ve .PP For example, .PP .Vb 3 \& $response = $ua\->get(\*(Aqhttp://search.cpan.org/\*(Aq, \& \*(Aq:content_file\*(Aq => \*(Aq/tmp/sco.html\*(Aq \& ); .Ve .PP When you use this \f(CW\*(C`:content_file\*(C'\fR option, the \f(CW$response\fR will have all the normal header lines, but \f(CW\*(C`$response\->content\*(C'\fR will be empty. Errors writing to the content file (for example due to permission denied or the filesystem being full) will be reported via the \f(CW\*(C`Client\-Aborted\*(C'\fR or \f(CW\*(C`X\-Died\*(C'\fR response headers, and not the \&\f(CW\*(C`is_success\*(C'\fR method: .PP .Vb 2 \& if ($response\->header(\*(AqClient\-Aborted\*(Aq) eq \*(Aqdie\*(Aq) { \& # handle error ... .Ve .PP Note that this \*(L":content_file\*(R" option isn't supported under older versions of \s-1LWP,\s0 so you should consider adding \f(CW\*(C`use LWP 5.66;\*(C'\fR to check the \s-1LWP\s0 version, if you think your program might run on systems with older versions. .PP If you need to be compatible with older \s-1LWP\s0 versions, then use this syntax, which does the same thing: .PP .Vb 2 \& use HTTP::Request::Common; \& $response = $ua\->request( GET($url), $filespec ); .Ve .SH "SEE ALSO" .IX Header "SEE ALSO" Remember, this article is just the most rudimentary introduction to \&\s-1LWP\s0 \*(-- to learn more about \s-1LWP\s0 and LWP-related tasks, you really must read from the following: .IP "\(bu" 4 LWP::Simple \*(-- simple functions for getting/heading/mirroring URLs .IP "\(bu" 4 \&\s-1LWP\s0 \*(-- overview of the libwww-perl modules .IP "\(bu" 4 LWP::UserAgent \*(-- the class for objects that represent \*(L"virtual browsers\*(R" .IP "\(bu" 4 HTTP::Response \*(-- the class for objects that represent the response to a \s-1LWP\s0 response, as in \f(CW\*(C`$response = $browser\->get(...)\*(C'\fR .IP "\(bu" 4 HTTP::Message and HTTP::Headers \*(-- classes that provide more methods to HTTP::Response. .IP "\(bu" 4 \&\s-1URI\s0 \*(-- class for objects that represent absolute or relative URLs .IP "\(bu" 4 URI::Escape \*(-- functions for URL-escaping and URL-unescaping strings (like turning \*(L"this & that\*(R" to and from \*(L"this%20%26%20that\*(R"). .IP "\(bu" 4 HTML::Entities \*(-- functions for HTML-escaping and HTML-unescaping strings (like turning \*(L"C. & E. Bronte\*:\*(R" to and from \*(L"C. & E. Brontë\*(R") .IP "\(bu" 4 HTML::TokeParser and HTML::TreeBuilder \*(-- classes for parsing \s-1HTML\s0 .IP "\(bu" 4 HTML::LinkExtor \*(-- class for finding links in \s-1HTML\s0 documents .IP "\(bu" 4 The book \fIPerl & \s-1LWP\s0\fR by Sean M. Burke. O'Reilly & Associates, 2002. \s-1ISBN: 0\-596\-00178\-9, \s0<http://oreilly.com/catalog/perllwp/>. The whole book is also available free online: <http://lwp.interglacial.com>. .SH "COPYRIGHT" .IX Header "COPYRIGHT" Copyright 2002, Sean M. Burke. You can redistribute this document and/or modify it, but only under the same terms as Perl itself. .SH "AUTHOR" .IX Header "AUTHOR" Sean M. Burke \f(CW\*(C`sburke@cpan.org\*(C'\fR