This document describes rr.pm, a Perl module to read ReDIF, version 1 data. It is is available in PDF format for US letter size paper and for A4 size paper .
rr.pm is a Perl module. It reads ReDIF data and validates the structure of the contents against a ReDIF specification contained in a separate specification file. It puts the valid templates into a hash structure where they can be easily accessed for further processing.
The software is available at ftp://openlib.org/acmes/root/soft/ReDIF-perl/. Unpack the software, then read any file called README for the latest changes that may not be documented in this page yet.
The package contains
rr.pm
perl module, an application interface to the ReDIF
parser
ReDIF::init.pm
perl module as a shared tool
for ReDIF developers
rech
and redif.spec
There are two ways to install the programme, called "standard" and "stand-alone", respectively. Before installing the programme, you should choose a ReDIF home directory.
Iou will want to use this package if you have or are going to have your own ReDIF metadata archive. Alternatively, you may find it useful if to produce or process, ReDIF data or write software that uses ReDIF data.
If you have a directory structure as set out by the Guildford protocol , then the ReDIF home is your ReDIF archive directory. If you do not have such a directory structure, proceed as follows, to create a stand-alone installation.
Choose or create a directory that will be the ReDIF home and copy the
spec/
, etc/
and doc/
directories of the package. If you do not
have super user rights to install the package with "make install" command
into a generally accessibly Perl library tree, then you also need to copy
the lib/
directory and the rech
and rere
scripts. Create a
data/
directory as a subdirectory of the ReDIF home and store your
ReDIF data files in it. This method is known as a "stand-alone"
installation of ReDIF-perl. Don't forget to look at the
$REDIFHOME/etc/rech.conf.eg
file to configure the ReDIF checker
rech.
Note that the package also contains a ReDIF specification file in the
spec/
directory. However, please check with the community that uses
ReDIF if that specification is current. In the etc/
you will find
default configuration file for rech. The Perl modules themselves live in
the lib/
directory.
Get the package file, decompress and unarchive it. To install the package use the following standard commands:
perl Makefile.PL make make test make install
The perl Makefile.PL
command will also execute the Configure
utility which will help you to set the your ReDIF home directory.
Configure will ask you about the directory name and will suggest to create
one if it does not already exist. However it will not create more than one
level of new directories.
The last step, make install
installs rr.pm and other related
libraries into a directory where perl will find them. For this step you
will most likely wish to use superuser rights, if you have them. If you
need to make a private installation of ReDIF-Perl use the
PREFIX=
DIR option of perl Makefile.PL
command or execute
Makefile.PL with your private copy of perl. After this procedure the
rech and rere scripts shall be installed to a directory which is
already included into your $PATH
(e.g. /usr/bin
). They can then be
executed by from everywhere on your system.
You do not need to determine the ReDIF home directory at the installation
stage. In that case you use the environment variable, REDIFDIR and run
the software from that directory, or you give a command-line parameter
--rdir
(or --redif.dir
). The choice is yours.
To call a function from rr.pm, you need to make Perl find it by a
use
or require
Perl statement.
use rr;
The rr.pm module provides a ReDIF reading interface through the
functions rr::OpenRDF
, rr::OpenDir
, rr::OpenDirTree
and
rr:NextTemplate
. The first three of these are for opening the data
source of different types. They initiate a data stream to be read.
The rr::NextTemplate
function iterates through the stream of data,
opened by an rr::Open...
function. It allows you to access the actual
data, template by template.
rr::OpenRDF
Use:
rr::OpenRDF
filename
rr::OpenRDF
filename[,
filepos]
rr::OpenRDF
filename[,
filepos,
][showfilename]
This is the basic function, which initiates reading of a ReDIF file filename. It returns 1 if successful, 0 if not.
The optional filepos parameter (integer number) shall be used to read the file from a specific position within the file. This may be useful if you need to get a specific template as quickly as possible. By default the file is being read from the beginning, of course.
The optional showfilename shows how the file should be referred to,
i.e. the logical file name. This will be the name in the
$::HashT{'FILENAME'}
. Imagine that you read a file called
/home/tim/RePEc/bob/bobseri.rdf
. You may want it to be referred to
as just bob/bobseri.rdf
, because /home/tim/RePEc/
is not
important and will be the same for all ReDIF files being read. In that
case, use
rr::OpenRDF
( '/home/tim/RePEc/bob/bobseri.rdf', 0, 'bob/bobseri.rdf');)
This feature may appear superfluous, but is quite useful in a range of applications. By default, the logical name is the same as physical filename.
rr::OpenDir
Use:
rr::OpenDir
directory
rr::OpenDir
directory[,
showprefixlength]
rr::OpenDir
directory[,
showprefixlength][,
filtersub]
Whereas rr::OpenRDF
function initiates a one-file data stream, this
function opens a whole directory of ReDIF files for consequent reading
template by template. Thus it is a higher level function. The exact
filename the template has been read from is stored in the
$::HashT{'FILENAME'}
variable at each specific template. Following the
conventions of the Guildford protocol, only files with the extension
.rdf
(case insensitive) are taken. The function returns the number of
ReDIF files found if successful, 0 if not. You give the pathname of the
directory to be read in directory.
The optional showprefixlength parameter is for the same reasons as
showfilename parameter at rr::OpenRDF
. But here it just sets how
many of the starting characters of the directory to strip as
meaningless to get a nice logical name. So if you, say, read the directory
/home/bob/RePEc/bob/dylan0
, and you want logical names of the files to
be bob/dylan0/...
then give length of /home/bob/RePEc/
as a
showprefixlength, i.e.
rr::OpenDir('/home/tim/RePEc/bob/dylan0',length('/home/tim/RePEc/'));) By default, the logical name is the same as the physical full filename.
If the optional filtersub parameter is specified, it applies a user-defined filter function to choose templates that satisfy a user-defined set of criteria. This feature is described in Section 6.
rr::OpenDirTree
Use:
rr::OpenDirTree
directory
rr::OpenDirTree
directory,
[showprefixlength]
rr::OpenDirTree
directory,
[showprefixlength],
[filtersub]
This function is very much like the previous one, but it does a
sub-directory tree search for all ReDIF files. And similarly it returns
number of ReDIF files found if successful, 0 if not. The showprefix
and filtersub parameters have the same meaning as for the
rr::OpenDir
function.
rr::NextTemplate
This function iterates the current data stream (previously opened) to the
next template. It returns 1 if successful, 0 if not. In case of a one-file
stream rr::OpenRDF
, if 0 is returned from the rr::NextTemplate
it
means that you have reached the end of the file. In case of multi-file
streams rr::OpenDir
and rr::OpenDirTree
it means that you have
reached the last correct template of the last ReDIF file that was found.
There are two ways of accessing the data in your scripts. You may choose
to use one of them or both (or none) by setting appropriate options in the
hash %rr::Options
. The selected modes will influence the way reading
goes internally and each way of accessing the data takes some run-time
resources to prepare.
The two major forms of the data presentation are
%::HashT
When the structured presentation is enabled, then after a successful
rr::NextTemplate
function, a variable %::HashT
will be filled up
with a ReDIF template data according to the following rules.
4.1.1: Rule (a): non-cluster attributes
Simple (non-cluster) attributes of the read template become keys of the
%::HashT
and these keys have the values associated with them. The names
of keys are converted to lowercase. For example, Handle is a simple,
non-cluster attribute. After reading a template, $::HashT{'handle'}
(mind lower case!) will give you the value of the handle.
Example
print "Paper: $::HashT{'title'} ($::HashT{'handle'})\n";
If $T
is a reference to the hash, then you may use it to access the data
from %::HashT
by writing a bit less code. For example, in the following
piece of code you can observe an example of using a reference to %::HashT
instead of using %::HashT
itself. After reading a template the script
processes it depending on its type:
# Assigning to the $T variable a value of the reference to %::HashT $T = \%::HashT; # # some rr::Open... function call is assumed here # return if !rr::NextTemplate; # loading a new template # # checking the template type: # if ($T->{'template-type'} eq 'redif-paper 1.0') { print "Keywords: $T->{'keywords'}\n"; } elsif ($T->{'template-type'} eq 'redif-archive 1.0') { print "Archive description: $T->{'description'}\n"; }
Clusters with templates are repeatable. In the hash %::HashT
they are
represented as elements of the @$::HashT{'author'}
array. Take the
example of authors who described by the "Author-(PERSON*)" cluster. The
"author" key of %::HashT
will reference to an array, which has as
many entries as there are authors. Each value will point to an independent
hash. This has contains the individual author's "(PERSON*)" cluster data.
For example,
# assume a paper template has been successfully loaded print "The paper has " , $#$::HashT{'author'}+1, " author(s):\n"; # # now iterate through each author foreach $au ( @$::HashT{'author'} ) { # now $au contains a reference to an author's data print "The author is $au -> {'name'}\n" ; }or
# initializations $T = \%::HashT; $authors = $T->{'author'}; # # an effort to be correct in English: checking the number of authors # if ($#$authors > 0) { $suffix = 's'; } else { $suffix = ''; } # print "Author$suffix:\n";
4.1.3: Rule (c): attributes in clusters
Each cluster data will in turn be coded as a hash with cluster attributes
as keys, similar to the rule (a). Clusters attributes will have the cluster
prefix (e.g. author-
) stripped off. For instance,
$::HashT{'author'}->[0]->{'name'}
will give you the first author's
name, i.e. the value contained in the "Author-Name" field). Note that the
->
are here optional. If there are more than just one author, then
$::HashT{'author'}[1]{'postal'}
will give second author's address,
i.e. the "author-postal" attribute in the second author cluster. This
rule is valid for all clusters at all levels. If we have one cluster nested
within another (like "workplace-(ORGANIZATION*)" cluster in a
"(PERSON*)" cluster), then the latter cluster's hash will give access
to the second-level cluster hash. A long expression like
$::HashT{'editor'}[1]{'workplace'}[0]{'postal'}
would specify the
postal address of the first workplace of the second editor of a series.
4.1.4: Other hash %::HashT
elements
The hash will also contain some additional information that is local to your site. Uppercase letters are used for this purpose. At the moment the following information will be provided:
%::HashT{'FILENAME'}
is the name of a file where the
template has been read from
%::HashT{'STARTFPOS'}
is the starting position of a template in the file
$::HashT{'BUFFER'}
will be assigned a value if only the Buffer
option is turned on. It will contain a multi-line string with the full-text
of the preprocessed ReDIF template. This is actually the way how the second
form of the data can be accessed.
At the sub-hash (clusters) level there is one more technical uppercase
variable: PREFIX
, e.g. $T->{'file'}[0]{'PREFIX'}
or
$T->{'author'}[0]{'PREFIX'}
. This key stores the cluster
attributes' prefix: 'file-'
and 'author-'
respectively for the
examples. The prefix and cluster's hash keys may be used to get the
original attributes of the template by uniting them in one string.
Other uppercase keys of %::HashT
may be used for internal or other
reasons in the future as software development goes on. User scripts
can use the keys of %::HashT
as listed above, but should ignore
any other of them.
With this method you get the whole template in one string variable. Each line contains one attribute, each attribute is separated from each value by a ': ' combination, each line is separated from each other by a newline character. Extra whitespace, tabulation or new-line characters are removed. All multi-line values are converted to single-line. All attribute:value pairs come checked and pre-processed.
By default this data supply method is disabled. To enable this method you
need to turn on the 'Buffer'
option of the %rr::Options
(before
opening a file), for example
$rr::Options {'Buffer'} = 1; rr::OpenRDF ( $file ) || die;
After a successful rr::NextTemplate
, you get a template in a string
$::HashT{'BUFFER'}
.
For example
print "\$::HashT\{'BUFFER'} = '$::HashT{'BUFFER'}' ;\n";
will, for example, produce the following output
$::HashT{'BUFFER'} = 'template-type: ReDIF-Series 1.0 name: CEP Discussion Papers description: Discussion papers on Macroeconomics and Labour Economics keywords: Macroeconomics, Labour editor-name: Richard Layard publisher-name: Centre for Economic Performance and ESRC publisher-homepage: http://cep.lse.ac.uk/ maintainer-name: Anita Bardhan-Roy maintainer-email: a.bardhan-roy@lse.ac.uk handle: RePEc:cep:cepdps' ;
The user of rr.pm module can influence some aspects of the way it works. We have already mentioned several of the options. Here comes a full description.
Option: 'HashT'
Default: enabled (1)
Meaning: This options sets whether to build the %::HashT
variable from
the template attributes and values. If this is enabled, the full
template data will be put into %::HashT
. If
disabled, only the FILENAME
, STARTFPOS
and BUFFER
keys will have a
value in %::HashT
.
option: 'Buffer'
Default: disabled (0)
Meaning: This option sets whether you want to get a full-text of a template
in %::HashT{'BUFFER'}
. By default, it is disabled.
You may enable both options, but we recommend you to choose only what is necessary to avoid performance losses.
Option: 'BufferEmpty'
Default: disabled (0)
Meaning: If 'Buffer' is enabled this option sets how to treat empty
attributes (with null value). If enabled, empty-value attributes will be
included into the %::HashT{'BUFFER'}
and as a %::HashT{...}
lower
case attribute, otherwise it would be ignored as meaningless.
Option: 'ReadX-Attr'
Default: disabled (0)
Meaning: This option determines whether to process and show to the
so-called X-attributes. X-attributes are the attributes that start with the
'X-' sequence. If disabled, X-attributes will be ignored. If enabled they
will be included into %::HashT{'BUFFER'}
and as a %::HashT{...}
lower case attribute.
There are some other options used with by the rech and rere scripts. They should not be interesting to the rr.pm users.
When you use the rr::OpenDir
and rr::OpenDirTree
functions for accessing a bunch of
ReDIF files, you may set a filter for the templates. Such a filter will
guarantee that while going through an opened data stream with
rr::NextTemplate
, you will only get the templates which meet a certain
criteria. For example, you may want to choose templates by type: archive,
series, paper, article, software and so on.
To execute such a search with filter, you prepare a function that checks the criteria you need and returns the result. If the template is fine, it returns true (e.g. 1), if not it returns zero or the undefined value.
When you call the rr::NextTemplate
subroutine, it will find a next piece of
data for you and then run the filter you have set. If the filter returns
true, then rr::NextTemplate
will allow your programme to process it. If
filter returns a false value, rr::NextTemplate
will look for a next one
template to offer.
Here is a small example of using the filter. It should make things clearer.
# $RepecRemo = mirrored RePEc archives data directory # # this is a filter function that checks a criteria sub articlefilter { return 1 if $T->{'template-type'} eq 'redif-article 1.0'; return 0; } sub checkOpenDirTree { my $f, $c, $l = length ("$RepecRemo") + 1; $T = \%::HashT; # executing a search with a filter here: $f = rr::OpenDirTree ( "$RepecRemo/cre/", $l, \&articlefilter ) ; print "\nOpenDirTree $RepecRemo/cre/ : found ", $f , " .RDF file entries\n" ; # processing ... (only article templates will get here) while ($c = rr::NextTemplate ) { print $T->{'FILENAME'} , ' : ' , $T->{'handle'}, "\n"; $count ++; } }
This checkOpenDirtree
subroutine will seek for ReDIF files in the
cre archive directory and in its subdirectories and will report
filenames and handles of each "article" template found.
If you are writing an application that just needs to work in ReDIF
environment and wants to get the same installation and configuration info
as included software does, you may use perl module ReDIF::init.pm.
Function initialize()
from that package will look for a configuration
information saved in a Perl-reachable module, will analyze the @ARGV
array for command-line arguments, the environment variables and the current
working directory, if necessary. It can help your program to identify the
ReDIF home directory, and it's type i.e. Guildford protocol complient or
stand-alone. It will find the configuration file for your program in the
apropriate directory (if it needs one).
It will report the main configuration values worked out to the user (unless
you make it silent, which is easy) and will save them for you in the
%ReDIF::CONFIG
hash variable. It will not import any symbols to your
package's namespace (unless you ask it to).
For a detailed discussion of this module please see it's manpage
(man ReDIF::init
) or pod data in lib/ReDIF/init.pm
.