![]() | Reading ReDIF files: the rr.pm module |
rr.pm
is a perl module to read ReDIF data.
Its origin is the rere
script by Ivan Kurmanov. This was as script
that read and checked a single ReDIF file. rere
was pretty universal
and easy to use. It had some disadvantages:
Now you may use an alternative way to read RePEc data. The module
rr.pm
was designed for this purpose. You set the options of how
you would like to get the results and then go ahead to reading. Most
options can be left with default values in most cases. You call a
function of rr.pm
and by that you start reading the data. To call a
function from rr.pm
, you need to make Perl find it by a use
or
require
Perl statement. You can read data of a specific file. Or you
can ask rr.pm
to search through a directory structure or a single
directory for ReDIF (.RDF) files to read all of them.
Then you can get the data, template by template, always only one template at a time. Only valid, checked and pre-processed ReDIF templates will be delivered to your programme. You call a special function each time you want to process next template, including right at the start, when you process the first template.
You will not get template data as a return of the function. The
template data will be prepared for you in a general hash variable, (in
main package namespace, called %::HashT
. By checking this hash'
keys and values you get the data. Apart from the template itself, the
hash will contain some additional information: the filename where the
template has been read from, the file position the template has
started at, et cetera.
The module provides ReDIF reading interface through the functions:
OpenRDF
, OpenDir
, OpenDirTree
and NextTemplate
.
The first three of these are for opening the data source (of different
types) - they
initiate reading. The NextTemplate
iterates through the stream of
data, opened by an Open... function, allowing you to access actual
data, template by template.
OpenRDF FILENAME OpenRDF FILENAME, FILEPOS OpenRDF FILENAME, FILEPOS, SHOWFILENAMEThis is a basic function, which initiates reading of a ReDIF file
FILENAME
.
It returns 1 if successful, 0 if not.
FILEPOS
parameter (integer number) shall be used to read the file from
some specific position. This may be useful if you need to get a
specific template and as quickly as possible (and if you know its
starting position). By default the file is being read from the
beginning, of course.
SHOWFILENAME
is how the file should be referred to
(logically). For example, this will be the name in the $::HashT{'FILENAME'}
.
Imagine that you read a file called /home/tim/RePEc/bob/bobseri.rdf
.
You may want it to be referred to as just bob/bobseri.rdf
, because
/home/tim/RePEc/
is not important and will be the same for all ReDIF
files being read.
&rr::OpenRDF( '/home/tim/RePEc/bob/bobseri.rdf', 0, 'bob/bobseri.rdf');
One may say it is strange and nonsense. I would not argue -- just don't use it, if you don't find it useful.
By default, the logical name is the same as physical FILENAME
.
OpenDir DIRECTORY OpenDir DIRECTORY, SHOWPREFIXLENGTH OpenDir DIRECTORY, SHOWPREFIXLENGTH, FILTERSUB
If OpenRDF function starts one-file data stream, this function opens a
whole directory of RDF files for consequent reading template by
template, without bothering user with opening specific files. So it's
a higher level function. The exact filename the template has been read
from can be taken from $::HashT{'FILENAME'}
variable at each
specific template. Only files with 'rdf' extension (case insensitive)
are taken.
It returns number of ReDIF files found if successful, 0 if not.
You give the pathname of the directory to read in DIRECTORY
.
SHOWPREFIXLENGTH
parameter is for the same reasons as SHOWFILENAME
parameter at OpenRDF. But this value just sets how many of the
starting characters of the DIRECTORY to strip as meaningless to get a
nice logical name. So if you, say read the directory
/home/bob/RePEc/bob/dylan0
, and you want logical names of the files
to be bob/dylan0/...
then give length of /home/bob/RePEc/
as a
SHOWPREFIXLENGTH
. e.g.,
rr::OpenDir ('/home/tim/RePEc/bob/dylan0' , length ('/home/tim/RePEc/') );By default, the logical name is the same as physical full filename.
If FILTERSUB
parameter specified, it sets the user-defined filter
function to choose templates of specific type (criteria) from the
whole stream. This feature is described below in a separate section.
OpenDirTree DIRECTORY OpenDirTree DIRECTORY, SHOWPREFIX OpenDirTree DIRECTORY, SHOWPREFIX, FILTERSUB
This function is very much like the previous one, but it does a sub-directory tree search for all RDF files. And similarly it returns number of ReDIF files found if successful, 0 if not.
SHOWPREFIX
and FILTERSUB
parameters have the same meaning as
for OpenDir function.
This function iterates current data stream (previously opened) to the
next template. Returns 1 when successful, 0 if fails. In case of
one-file stream OpenRDF
, 0 returned from the NextTemplate would
mean that you have reached the end of the file, in case of multi-file
streams OpenDir
and OpenDirTree
it means that you have reached
the last correct templates of the last found ReDIF file found.
There are two ways of accessing the data in your scripts. You may
choose to use one of them or both (or none) by setting appropriate
options in the hash %rr::Options
. The selected modes will influence
the way reading goes internally and each way of accessing the data
takes some run-time resources to prepare.
Two major forms of the data presentation: full-text (rere-like) in one multi-line string variable and a structured (complex, nested) hash variable, structured to reflect template logic - clusters. Both forms allow processing of one template by a time only.
Let's discuss the hash form first.
When this method (data form) is enabled, after a successful
NextTemplate function, %::HashT
will be filled up with data according
to the following rules.
4.1.1: Rule (a): non-cluster attributes
Simple (non-cluster) attributes of the read template become keys of
the %::HashT
and these keys have the values associated with
them. The names of keys are converted to lowercase. For example,
Handle is a simple, non-cluster attribute. After reading a
template, $::HashT{'handle'}
(mind lower case!) will always give
you the handle. For other simple attributes like Title,
Abstract, Creation-Date (for Paper or Article templates), Name
(Archive, Series), Template-Type and many others, it is exactly
the same. Example:
print "Paper: $::HashT{'title'} ($::HashT{'handle'})\n"; ... print "Archive name: $::HashT{'name'}\n";
If $T
is a reference to the hash, then you may use it to access the
data from %::HashT
but writing a bit less code. For example:
# Assigning to the $T variable a value of # the reference to %::HashT $T = \%::HashT; ... # some rr::Open... function call is supposed here ... return if !&rr::NextTemplate; # loading a new template # checking the template type: if ($T->{'template-type'} eq 'redif-paper 1.0') { ... print "Keywords: $T->{'keywords'}\n"; } elsif ($T->{'template-type'} eq 'redif-archive 1.0') { ... print "Archive description: $T->{'description'}\n"; }
In this piece of code you can observe example of using a reference to
%::HashT
instead of using %::HashT
itself. After reading a next
template the script processes it depending on it's type.
Clusters of templates are repeatable. They may be repeated several
times in one template and the same attributes of each repeated cluster
should not be mixed up, as they are supposed to only have sense in a
cluster. Each cluster of a template read by rr.pm
will be represented
as an array element. For example, document's authors referred to by
Author-Name attribute and more generally described by Author-(*PERSON)
cluster will be represented in the hash %::HashT
as elements of the
@$::HashT{'author'}
array.
This means that 'author' key of %::HashT will point to (reference to) an array, which has as many entries as many authors there were. Each value will point to an independent hash, containing author's (*PERSON) cluster data.
For example,
... # assume a paper template has been succesfully loaded print "The paper has " , $#$::HashT{'author'}+1, " author(s):\n"; # now iterate through each author foreach $au ( @$::HashT{'author'} ) { # now $au contains a reference to an author's data ... $au -> ... ... }or
# initializations $T = \%::HashT; $authors = $T->{'author'}; # an effort to be correct in English: checking the number of authors if ($#$authors > 0) { $suffix = 's'; } else { $suffix = ''; } print "Author$suffix:\n"; # same as in previous example: foreach $au ( @$authors ) { .... }
4.1.3: Rule (c): attributes in clusters
Each cluster data will also be coded as a hash with cluster attributes
as keys, similar to the rule (a). Clusters attributes will have
cluster prefix (e.g. author-) stripped off. For instance,
$::HashT{'author'}->[0]->{'name'}
will give you the first author's
name, Author-Name attribute's value). Note that the ->
are here for
verbosity. Perl allows to miss them. If there are more than just one
author, then $::HashT{'author'}[1]{'postal'}
will give second
author's address, i.e. the author-postal attribute in the second
author cluster).
And this rule is valid for all clusters at all levels. This mean that
if we have a cluster in a cluster (like workplace-(*ORGANIZATION)
cluster in a (*PERSON) cluster), than the later cluster's own hash will
point you to the second-level cluster hash. A long expression like
$::HashT{'editor'}[1]{'workplace'}[0]{'postal'}
would make sense to
specify address of the first workplace of the second editor of a
series.
4.1.4: Other hash %::HashT elements
The hash will also contain some additional information that is local to your site. Upper-case keys are used for these purposes. At the moment the following information will be provided:
%::HashT{'FILENAME'}
- name of a file, that template has been read
from
%::HashT{'STARTFPOS'}
- an integer number, signifying starting
position of a template in the file
$::HashT{'BUFFER'}
- this key will be assigned a value if only the
'Buffer' option is turned on. It will contain a multi-line string with
full-text pre- processed ReDIF template, just read. This is actually
the way how the second form of data can be accessed.
And one more technical upper-case key will be engaged at sub-hash
(clusters) level: PREFIX. e.g., $T->{'file'}[0]{'PREFIX'}
or
$T->{'author'}[0]{'PREFIX'}
. This key will give you the cluster
attributes' prefix: 'file-'
and 'author-'
respectively for the
given above examples. You may use the prefix and cluster's hash keys
to get the original attributes of the template by uniting them in one
string.
More upper-case keys of %::HashT
may be used for internal or other
reasons in the future as software development goes on. So user scripts
can use the keys of %::HashT
as listed above, but should ignore
any other of them.
Rere-like output of rr.pm
is an option, which was primary implemented
for easy upgrade from rere
to rr.pm
. Though you may rely on it as much
as you like - I don't think it is going to be modified in the
future. With this method you get the whole template in one string
variable - each line contains one attribute, each attribute is
separated from each value by a ': ' combination, each line is
separated from each other by a "\n" character, no extra spaces,
tabulation or new-line characters. All multi-line values are converted
to single-line. All attribute/value pairs go checked and
pre-processed.
By default this data supply method is disabled. To enable this method you need
to turn on the 'Buffer' option of the %rr::Options
(before opening a file):
$rr::Options {'Buffer'} = 1; ... &rr::OpenRDF ( $file ) || die;
You can switch this option between reading different files if you need, but I can't guarantee you positive results if you switch it while reading one file.
After a successful &rr::NextTemplate
, you get a template in a string
as described above in a $::HashT{'BUFFER'}
.
For example
print "\$::HashT\{'BUFFER'} = '$::HashT{'BUFFER'}' ;\n";
will produce the following output (for instance):
$::HashT{'BUFFER'} = 'template-type: ReDIF-Series 1.0 name: CEP Discussion Papers description: Discussion papers on Macroeconomics and Labour Economics keywords: Macroeconomics, Labour editor-name: Richard Layard publisher-name: Centre for Economic Performance and ESRC publisher-homepage: http://cep.lse.ac.uk/ maintainer-name: Anita Bardhan-Roy maintainer-email: a.bardhan-roy@lse.ac.uk handle: RePEc:cep:cepdps' ;
You as a user of rr.pm
module can influence some aspects of the way it
works. I have already mentioned several of the options. Here comes
full description:
'HashT'
Default: enabled (1)
Meaning: This options sets whether to build the %::HashT
variable from
the template attributes and values. If this is enabled, the full
template data: attributes\values will be put into the %::HashT
. If
disabled, only 'FILENAME', 'STARTFPOS' and 'BUFFER' keys will have a
value in %::HashT
.
'Buffer'
Default: disabled (0)
Meaning: This option sets whether you want to get a full-text of a template in %::HashT{'BUFFER'}. By default, it is disabled.
You may enable both of this options, but I recommend you to choose only what is necessary, to avoid performance losses.
'BufferEmpty'
Default: disabled (0)
Meaning: If 'Buffer' is enabled this option sets how to treat
empty attributes (with null value). If enabled, empty-value attributes
will be included into the %::HashT{'BUFFER'}
and as a %::HashT{...}
lower case attribute, otherwise it would be ignored as meaningless.
'ReadX-Attr'
Default: disabled (0)
Meaning: Whether to process and show to your script the
attributes, starting with 'X-' sequence. If disabled, X- attributes
will be ignored. If enabled - they will be included into
%::HashT{'BUFFER'}
and as a %::HashT{...}
lower case attribute.
Other options exist only for special reasons and employed by me with
scripts rech
and rere
and normally they should not be interesting to
the rr.pm
users.
When you use OpenDir and OpenDirTree functions for accessing a bunch of ReDIF files, you may set a filter for the templates. Such a filter may guarantee, that while going through an opened data stream with NextTemplate, you will only get the templates which meet a certain criteria. For example, you may want to choose templates by type: archive, series, paper, article, software and so on.
To execute such a search with filter, you can prepare a function, that checks the criteria you need and returns the result. If the template is fine, it returns true (e.g. 1), if not it return zero or the undefined value.
This is absolutely simple. When you call NextTemplate subroutine, it will find a next piece of data for you and then run the filter you've set. If filter says Ok, then NextTemplate will allow your programme to process it. If filter refuses by a false value, NextTemplate will look for a next one template to offer. So in fact, setting the filter may cause the module to read many templates in vain, wasting precious run-time resources. But some peopl may find it efficient from programming efficiency point of view: convenient, easy, etc.
Here is a small example of using the filter. It should make things clearer.
# $RepecRemo = mirrored RePEc archives data directory # this is a filter function that checks a criteria sub articlefilter { return 1 if $T->{'template-type'} eq 'redif-article 1.0'; return 0; } sub checkopendirtree { my $f, $c, $l = length ("$RepecRemo") + 1; $T = \%::HashT; # executing a search with a filter here: $f = &rr::OpenDirTree ( "$RepecRemo/cre/", $l, \&articlefilter ) ; print "\nOpenDirTree $RepecRemo/cre/ : found ", $f , " .RDF file entries\n" ; # processing ... (only article templates will get here) while ($c = &rr::NextTemplate ) { print $T->{'FILENAME'} , ' : ' , $T->{'handle'}, "\n"; $count ++; } }
This checkopendirtree
subroutine will seek for ReDIF files in the
cre archive directory and in its subdirectories and will report you
filenames and handles of each article template found (if there are
any).