IceBearSoft Perl Package version 1.0

This document describes a few perl modules. Usually Perl programs are documented via POD but Perl programs are portable across platforms and MAN pages are only available on UNIX. This is the reason why I decided to use HTML for documenting my modules.

Licence

The software can be used and redistributed according to GPL or GPL for the Czech Republic. See http://icebearsoft.euweb.cz/czgpl/ for detailed information.

Installation

The modules and scripts require Perl5 and some of them are object-oriented. I have seen Windows implementation of Perl5 which does not recognize the new method. Be sure that you have the correct version of Perl.

Before you install the files, you have to convert line endings. Since the development is done on OS/2, all files have DOS line endings. Afterwards you switch to the main directory of this distribution and run install.pl. It recognizes the following options:

Default library directory is taken first from $ENV{'PERL5LIB'}, then from $ENV{'PERLLIB'} if it does not contain path separator.

Default binary directory is the first segment of $ENV{'PATH'} which starts with the contents of $ENV{'ZWPERL'} (if it is defined).

Default CGI directory is taken from $ENV{'CGIDIR'}.

Default HTML directory is taken from $ENV{'HTMLDOC'}.

Question: Is it safe to call flip -u **/* or dos2unix **/*, possibly repeatedly? If so, the next version will check existence of these programs and convert the line endings automatically.

Comment: Since it is not necesary to put the files to the system directories and paths can be defined in the environment variables in the user's profile, it is not necessary to have root permissions and any user can install the package himself or herself. Anyway, if it seems to be usable for all users in the system, it is better to ask the root for system-wide installation.

Tip: If you do not wish to install one of the directries in this distribution, just specify /dev/null as the installation path.

Platform (in)dependence

The modules and scripts were developed and tested on OS/2 Warp 3.0 and 4.0. Unlike UNIX, OS/2 distinguishes ASCII and binary files. The scripts therefore contain BINMODE. This should be harmless on UNIX systems. If it causes problems, let me know and I will make a switch.

Moreover, all supplied files have DOS line endings. Most probably you must convert them before the modules can be used on UNIX systems.

Description of modules

This document describes the following modules and scripts:

  1. ZWdebug.pm
  2. PrintList.pm
  3. Zwebfun.pm
  4. ZWsgml.pm
  5. ZWurl.pm
  6. isA.pm
  7. http.pl
  8. http-retrieve.pl
  9. perl.cmd
  10. LinkChecker

ZWdebug.pm

This is a simple module which is useful mainly for debugging. It is able to list scalar variables, arrays and hashes to standard output. It is the very first module which I wrote (because my first script did not work and I did not know why) therefore it is really very simple. The module is superceded by PrintList.pm.

Requirements: Exporter

See also: ZWurl.pm

Usage

The module uses EXPORT_OK. It is therefore necessary to list all functions which you want to use. See the example at the end of the description.

sub list_array($;@);

This function lists the contents of an array. The first argument is the title which will appear on the printout. If the first argument is identical to the name of the array, the second argument may be omitted. Thus the following two statements have the same meaning:

list_array 'my_array', @my_array;
list_array 'my_array';

sub list_hash($;%);

This function lists the contents of a hash. It displays keys and associated values but it does not recurse it the value is not a scalar. The second argument may be omitted if the text is identical to the name of the hash variable. The logic is similar as in list_array above.

sub list_scalar($);

This function lists a scalar variable. You should only specify the name of the variable as a simple text. The following code

$test = 'This is a test';
list_scalar 'test';

will display:

test = This is a test

sub list_scalars(@);

This function is very similar to the function above. It allows to specify an array of names of scalar variables, e.g.:

$a = 1;  $b = 'word';  $c = 'This is some text.';
list_scalars 'a', 'b', 'c';

Example

This is a script which was used for testing the module. The words are in Czech without accents...

#!perl5

use IceBearSoft::Zwebfun;
use IceBearSoft::ZWdebug(list_hash, list_array, list_scalar, list_scalars);

$wt = 3;

$request{'method'}='OPTIONS';
$request{'url'}='http://localhost/';

list_hash 'request', \%request;   sleep($wt);
%other = %request;
list_hash 'other', \%other;   sleep($wt);
list_hash 'pokus', {'a'=>'jedna', 'b'=>'dva'};   sleep($wt);
%other = &makeHash;  list_hash 'Funkce?', \%other;   sleep($wt);
list_hash 'Funkce???', {&makeHash};   sleep($wt);
list_hash 'Return hash', &retHash;   sleep($wt);
list_hash 'Return hash', retHash();   sleep($wt);

$a = 1;  $b = 'slovo';  $c = 'Toto je delsi text.';
list_scalar 'a';   sleep($wt);
list_scalar 'b';   sleep($wt);
list_scalar 'c';   sleep($wt);
list_scalars 'a', 'b', 'c';   sleep($wt);
list_array 'array', ['a', 'b', 'c'];   sleep($wt);
list_hash 'request';   sleep($wt);
list_hash 'request 2', \%request;   sleep($wt);
list_hash 'Hash', {'a'=>$a, 'b'=>$b, 'c'=>$c};   sleep($wt);
list_array 'Array', [$a, $b, $c];   sleep($wt);

exit;

sub makeHash {
  my %hash=('titul'=>'wizard','jmeno'=>'Gandalf');
  return %hash;
}

sub retHash { return {'titul'=>'wizard','jmeno'=>'Gandalf'};  }

PrintList.pm

This module provides an object which can list the contents of hashes and arrays as well as the values of scalars. It can even list internal variables of objects which are in fact special types of hashes. In most cases only one global instance is needed but you can have any nuber of instances as you like. The object keeps the list of references of arrays and hashes already listed. For instance, the hash rapresentation of the DOM tree may consist of a hash of the document node containing the list of references to the hashes representing the nested nodes. The hash of each node will contain a reference to the hash of the document node and/or its parent node. Listing such a tree would cause infinite loop. This object will therefore write already listed in such cases.

Requirements: Exporter

sub new($*)

This function instantiates the object. It has no parameters.

sub print_list

This is the only function for the user. It requires one or two arguments. The first argument must be a scalar or a reference. Its value or contents will be printed, the nested elements are indented. The second argument is optional. It must be a string which will be displayed as the title. The function writes its result to STDOUT. You can use select to redirect it elsewhere.

Zwebfun.pm

This module provides an object which enables communication via HTTP protocol with a WWW server. It was originally inspired by getclient.pl developed by Mark Gaither on 12 Feb 1994 (about tea time). The script was found at www.webtechs.com. Later it was rewritten and some features of HTTP/1.1 were added. Afterwards it was again enhanced because some HTTP/1.0 servers are ill programmed and stop communicating if they see HTTP/1.1 request (even if the request is actually HTTP/1.0 but the ability of sending HTTP/1.1 is signalled, some servers fail to communicate). Therefore, if the server sends HTTP/1.0 response, the module will note it and all further requests to that server will only be HTTP/1.0. The module also tries to use KeepAlive when communicating with HTTP/1.1 servers but this feature is useful only if the object is used several times from the same program.

Requirements: Carp, Socket, Exporter

sub new($*)

This function instantiates the object. It has no parameters and is called simply:

$http = new IceBearSoft::Zwebfun;

The function initializes the object and calls the system program hostname in order to obtain the network name of self. It is then used to get the IP address. Be sure that you have hostname properly installed and that it is found without specifying its full path. Alternatively you should modify the new function.

sub http

This is the only function for you. It accepts parameters supplied in a hash or preferably as a pointer to a hash, performs HTTP communication and returns the result as a pointer to another hash.

The parameter, containing the request, is a hash in which the following keys are recognized:

You will mostly supply headers Connection: close in order to inhibit KeepAlive on HTTP/1.1 connections (it may sometimes cause problems) and Byte-Range: ... for partial downloading.

The response contains the copies of fields file, proxy and url unless it finishes too early due to another error. In case of a serious error where connection cannot be established, field Error-Message is filled with an explanation text. In case of successfull connection all response lines are stored in the hash. The first response line does not have any name and is stored as Status-Line in the hash. For easier use it is also parsed into Status, Sub-Status, Protocol-Type (always HTTP but may change in future versions), Protocol-Version, Reason-Phrase. Sub-Status is not usually used, only some servers respond with status as 404.1 (the number after the period is used as a Sub-Status). The number of bytes received in this connection is stored in Bytes-Received. It may not be the same as Content-Length. Remember that Content-Length is the size of the object reported by the server while Bytes-Received is the number of bytes which were actually received by the module.

$http->{'read-timeout'}

This is a timeout value in seconds. If the module reads data from the server and no data are received within this time from the last reading, connection is closed. The default value is 30. You can change it to any value, e.g. to 60 by the following statement:

$this->{'read-timeout'} = 60;
Remember that too low value will cause almost all connections to be closed and too high value will diagnose lost connection after long time.

Examples

No example is available here. However, you can study scripts httpl.pl and http-retrieve.pl which are also distributed within this package.

ZWsgml.pm

This module provides a simple SGML parser. It does not use DTD. It can be used if DTD is not available. The parser can also parse SGML files which contain errors. It is useful mainly in cases when it is not necessary to parse the whole document but only contents of some tags are important. The parser returns only the tags and attributes or the plain text. It does not provide any structure information (unlike e.g. SGMLS). If you want to know whether the plain text is inside a tag, you have to maintain your own stack. However, ZWsgml doesn not use DTD and thus the parser does not know that e.g. <BR> in HTML does not have ending tag (XML would use <BR/>). Some ending tags may be omitted but ZWsgml will not be able to recognize it. Therefore, if you really need such information, you should preferably use another tool which parses documents according to DTD. The module is superceded by IceBearSoft::Xsgml which is distributes separately. See http://www.icebearsoft.cz/icebearsoft.euweb.cz/sw.php.

The module defines one object and one plain function.

Requirements: Exporter

sub new($*)

This function instantiates the object. It requires one argument which is the file handle of a document to be parsed. See the example at the end of this chapter.

sub nextline

This function reads a line into an internal buffer. You may use it if you wish to redesign the parser. Do not mix it with calls to nextTag. The parser stores its state in internal variables and if you mix calls to both functions, unpredictable things will happen. You will never call this function in normal situation.

sub nextTag

This function returns an array of two strings. The first string is a name of the tag (always in lowercase), the second string contains all attributes with their values. If no attributes are present, the second array element is an empty string. The plain text is returned in the second array element and the first element is an empty string. If both elements are empty, the end of file was reached. Notice in the example below that it is necessary to check the length of the elements. The contents of any element may contain a single digit 0 which may be incorrectly considered as false (it really happens with one HTML file from Apache documentation). If the elements are just tested for true and false, this condition may be incorrectly considered an end of file.

sub attributes($)

This is a plain function. It requires the second element of an array returned by nextTag and splits it into a hash. Attribute names serve as keys in the hash and are always converted to lowercase. Remember that some attributes do not have a value as e.g. <ol compact> in HTML. In such case the value is an empty string.

The function checks the first character after the equal sign. If it is a quote, the function assumes that the attribute value is delimited by quotes and will look for everything up to the next quote. The quotes will be removed from the attribute value. If the equal sign is followed by any other non-blank character, the function assumes that the value is terminated by the first white space.

Example

The following scripts parses the SGML document and displays the contents. First the $parser instance is created. At the beginning of a loop we call $parser->nextTag. If a tag is found, it is printed together with the string of attributes. The attributes are then split and the hash is displayed. The else part of the condition displays the plain text. Notice that we check the length of strings returned from $parser->nextTag. If we used while ($k || $v), the loop would incorrectly stop e.g. in case that the plain text contains only a single digit 0.

#!perl5

# Test of ZWsgml.pm, 31 Dec 1999

use IceBearSoft::ZWsgml;

($fn, $rest) = @ARGV;
if ($rest) { die "Superfluous arguments: $rest\n"; }

open (SGML, $fn) or die "Can't open $fn\n";
$parser = new IceBearSoft::ZWsgml (\*SGML); # open the file
do {
  ($k, $v) = $parser->nextTag; # get next tag
  if (length($k) > 0) { # tag found
    print "\n<", $k, "> ", $v if $k || $v;
    if ($v && substr($k,0,1) ne '!') { # comments ignored
      my %a = attributes($v); # get attributes and store them in a hash
      foreach $key (sort(keys %a)) {
        print "\n==> ", $key, ' = ', $a{$key};
      }
    }
  }
  else { # print plain text (outside tags)
    print "\n->", $v if length($v) > 0;
  }
} while (length($k) + length($v) > 0);
close SGML;

ZWurl.pm

This module provides a set of functions for operations with URLs.

Requirements: Exporter

Usage

The module uses EXPORT_OK. It is therefore necessary to list all functions which you want to use.

sub abs_url($$)

This function accepts two strings, a referrer and a relative path. It returns an absolute path. All occurences of '..' are removed and replaced with correct directory names. It is not an error to supply an absolute path as the second argument. The function will just normalize it by removing '..'. The function will return an undefined value if the referrer is invalid. The function works only with HTTP and FTP URLs.

sub split_url($)

This function splits the URL to its elements. The elements are returned in a hash. HTTP and FTP URLs will be split to 'scheme', 'host', 'port' (optionally), 'object' and 'label' or 'search' if present. Values of 'scheme' and 'host' are always lowercase. The colons are not considered a part of scheme and port specifications and the double slash is not a part of a host name. They are always stripped off. Also the leading question mark in search and the hash mark in label are deleted. If the object ends with '.' or '..', terminating slash is added.

The function can also split other URLs as mailto or news as defined in RFC1738. These URLs contain only scheme and object.

The hash will also contain element with name 'url' which is the original URL before splitting.

sub merge_url(%)

This function accepts a hash in a form as returned by split_url and merges the parts into a URL. It will ignore 'url' key if it is present. RFC 1738 does not allow usage of both a label and a search within URL but RFC 2048 says that the search parameters should appear before label. This function builds the URL according to RFC 2048.

sub url_decode($)

This function URL-decodes the string supplied as an argument.

sub url_encode($)

This function URL-encodes the string supplied as an argument. It encodes only the main separators and spaces. RFC1738 specifies other unsafe characters which are left unchaged by this function.

sub localize_url($$)

This function accepts a referrer and an absolute path and changes it to a relative path if possible. It works with HTTP and FTP URLs only. It may give wrong results if the second argument is a relative path. The best way for use is:

$local_path = localize_url $referrer, (abs_url $referrer, $other_path);

sub query_string()

This function returns the query string both for the GET and POST method. It is not ready for form-based file upload. The function should be called only once. The second call may cause errors.

sub cgi_parse(;$)

This function parses the supplied query string. If no string is supplied, the function will obtain the query string by call to query_string. The result is returned in a hash. If a name contains more values as with <select multiple> or <input type=checkbox>, the value is changed into a pointer to an array of strings. Remember that the function does not know in advance that the name can be associated with two or more values. Conversion of a string to a pointer to an array just occurs only at a moment when the second value is found.

All keys and values are automatically URL decoded. Do not call url_decode yourself, it will spoil your data.

sub parse_cookies(;$)

This function accepts a string with cookies and parses them into a hash. The cookie names as well as their values are automatically URL decoded. If no argument is specified, the function gets the string from the environment set by the WWW server.

sub http_url($)

This function is made for OS/2 but may be useful elsewhere. OS/2 does not allow to use // within command line arguments. Therefore this function prepends optionally http://. It will do nothing if the URL supplied as an argument already contains a scheme. The function should only be used for preparation of URL for Zwebfun or similar functions or objects. The function may fail with lots of valid URLs of different kinds.

sub WeekDay($)

The function returns an English name of the day. The argument must have the form as returned by gmtime.

sub Month($)

The function returns an English name of the month. The argument must have the form as returned by gmtime.

sub DateTime($)

This function returns a short version of date and time. It requires an argument in the same form as a result from the time function.

sub LongDateTime($)

This function returns a long version of date and time which is used in cookies. It requires an argument in the same form as a result from the time function.

sub set_cookie($)

This function accepts a pointer to a hash containing the following variables:

The function returns the string in the form usable in Set-Cookie. The name and value are automatically URL encoded.

sub print_ref($)

This function accepts a single variable which may be a scalar or any type of pointer. The contents is then printed to the default output in HTML. Scalars are printed as they are, arrays are printed as enumerations of elements, hashes are printed as key/value pairs and other types of pointers are just displayed as the name of the type. The function examines the types of elements of arrays and hashes. If the element is a pointer, the function is called recursively.

Example

The following code was used for testing. The URLs used are just groups of characters, it is not guaranteed that such objects really exist somewhere on the Internet. Notice that localization of local (not absolute) paths may give wrong results.

#!perl5
use IceBearSoft::ZWurl(url_encode, url_decode, split_url, abs_url, localize_url, http_url);
use IceBearSoft::ZWdebug(list_hash);

# Test URL encoding/decoding

$a = "a + b?\015\012\011&=5%";
$b = url_encode $a;
$c = url_decode $b;
print $a, "\n";  print $b, "\n";  print $c, "\n";
print "OK\n\n" if $a eq $c;
print "Failed!\n\n" if $a ne $c;

# Test URL splitting

%u = split_url 'http://www.cz/dir/subdir/file.html#nic';
list_hash 'URL splitting', \%u;
%u = split_url 'http://www.cz/dir/subdir/file.html';
list_hash 'URL splitting', \%u;
%u = split_url 'http://www.cz/dir/subdir/file.html?a=b&c=d';
list_hash 'URL splitting', \%u;
%u = split_url 'http://www.cz:10954/dir/subdir/';
list_hash 'URL splitting', \%u;
%u = split_url 'http://www.cz:1095';
list_hash 'URL splitting', \%u;
%u = split_url 'http://www.cz';
list_hash 'URL splitting', \%u;

# Test abs path

$r = 'http://www.icpf.cas.cz/users/wagner/default.htm';
print "\nReferer = $r\n";
$a = 'index.html';
print "$a => ", (abs_url $r, $a), "\n";
$a = '../wagner/index.html';
print "$a => ", (abs_url $r, $a), "\n";
$a = '../index.html';
print "$a => ", (abs_url $r, $a), "\n";
$a = '../kocka/index.html';
print "$a => ", (abs_url $r, $a), "\n";
$a = '/index.html';
print "$a => ", (abs_url $r, $a), "\n";
$a = 'kocka/index.html';
print "$a => ", (abs_url $r, $a), "\n";
$a = '../../kocka/index.html';
print "$a => ", (abs_url $r, $a), "\n";
$a = '//www.cz/index.html';
print "$a => ", (abs_url $r, $a), "\n";
$a = 'http://www.cz/index.html';
print "$a => ", (abs_url $r, $a), "\n";
$a = './';
print "$a => ", (abs_url $r, $a), "\n";
$a = '.';
print "$a => ", (abs_url $r, $a), "\n";
$a = '..';
print "$a => ", (abs_url $r, $a), "\n";
$a = '../';
print "$a => ", (abs_url $r, $a), "\n";

# Test localization

sub loc($$);
print "\nReferer for loc test = $r\n";
loc $r, 'index.html';
loc $r, '../index.html';
loc $r, '../wagner/index.html';
loc $r, '../kocka/index.html';
loc $r, '/index.html';
loc $r, '//www.cz/index.html';
loc $r, 'http://www.icpf.cas.cz/';
loc $r, 'http://www.icpf.cas.cz';
loc $r, './';
loc $r, '.';
loc $r, '..';
loc $r, '../';

# Test http_url

sub ht($);
print "\nhttp_url test\n";
ht 'http://www.icpf.cas.cz';
ht 'ftp://www.icpf.cas.cz';
ht '//www.icpf.cas.cz/wagner/';
ht 'www.icpf.cas.cz/wagner/frame.html';

# Subroutines

sub loc {
  my ($r, $a) = @_;
  print "$a => ", (localize_url $r, $a), "\n";
  my $b = abs_url $r, $a;
  print "$b => ", (localize_url $r, $b), "\n";
}

sub ht {
  my $x = shift;
  print "$x -> ", (http_url $x), "\n";
}

isA.pm

Object oriented langages as C++ or Java support polymorphism via abstract classes and virtual methods. On the contrary, perl does not offer any type checking. However, sometimes you may need to verify that the reference corresponds to an object derived from a particular class.

This package defines a single function which gets either a package name or an object reference and returns a top-down list of all package names ending usually with Exporter. This can be used if you wish to find whether the object is derived from a particular class. Suppose that you wish to check whether your object $obj is a MyPackage or derived from it by reblessing. You can use:

die 'Invalid object type' unless scalar(grep(/^MyPackage$/, isA $obj));

You can also supply the regular expression of valid types as the second optional parameter, e.g. isA $obj, '^My(First|Second)Package$'. This will check whether the object is MyFirstPackage or MySecondPackage. In list context the function will return all matches, in scalar context it returns number of matches. You can thus write:

die 'Invalid object type' unless scalar(isA $obj, '^My(First|Second)Package$');

or just simply

die 'Invalid object type' unless isA $obj, '^My(First|Second)Package$';

http.pl

This script performs a single HTTP request. The result is displayed on screen but the body of the response may be stored in a file.

Requirements: Getopt::Long, Carp, IceBearSoft::ZWdebug(list_hash), IceBearSoft::ZWurl(http_url), IceBearSoft::Zwebfun

Usage

The script is invoked with the following command line options:

The URL must be specified either in --url or as a combination of --host and --object. The URL is then completed by call to http_url. Specification of --bytes is used together with --file for partial download. It is, however, more comfortably achieved by http-retrieve.pl. Notice that the script sets header Connection: close in order to prevent KeepAlive.

http-retrieve.pl

This script serves for partial downloading of files over HTTP/1.1. It may even be used if the part of the file to be downloaded already exists.

Requirements: Getopt::Long, Carp, IceBearSoft::ZWdebug(list_hash), IceBearSoft::ZWurl(http_url), IceBearSoft::Zwebfun

Usage

The script accepts following command line options:

URL will be completed by call to http_url. The file may already exist and the script will immediately start in the append mode. You may specify maximum number of retries, default value is 50. If the script fails after specified number of retries, it is still possible to run it again with the same URL and file name specification.

Known bug: the script does not verify the Last-Modified response field. It is therefore possible that you will append a new object to an old file which will result in a mess. However, the whole response is always displayed on screen, so you can verify it yourself.

perl.cmd

This REXX script was developed for OS/2. It was necessary to specify full path of the Perl script because OS/2 strictly requires a backslash as a directory separator while Perl needs a forward slash. The script takes all arguments preceded by the minus sign as options for perl. The first argument, which does not start with minus, is a name of the Perl script. This REXX script optionally adds .pl extension. The Perl script is then looked for in the current directory and then accross %PATH%. The full path of the Perl script is then used. Remaining arguments are sent as options to the Perl script.

LinkChecker

This script is used to verify links on the web pages. It comes with its own documentation which is installed to your web server. You can also read it locally from this distribution.


Z. Wagner - Ice Bear Soft, http://icebearsoft.euweb.cz