RePEc

Reading ReDIF files: the rr.pm module

Ivan Kurmanov

1: Historical introduction

rr.pm is a perl module to read ReDIF data. Its origin is the rere script by Ivan Kurmanov. This was as script that read and checked a single ReDIF file. rere was pretty universal and easy to use. It had some disadvantages:

2: Module rr.pm

Now you may use an alternative way to read RePEc data. The module rr.pm was designed for this purpose. You set the options of how you would like to get the results and then go ahead to reading. Most options can be left with default values in most cases. You call a function of rr.pm and by that you start reading the data. To call a function from rr.pm, you need to make Perl find it by a use or require Perl statement. You can read data of a specific file. Or you can ask rr.pm to search through a directory structure or a single directory for ReDIF (.RDF) files to read all of them.

Then you can get the data, template by template, always only one template at a time. Only valid, checked and pre-processed ReDIF templates will be delivered to your programme. You call a special function each time you want to process next template, including right at the start, when you process the first template.

You will not get template data as a return of the function. The template data will be prepared for you in a general hash variable, (in main package namespace, called %::HashT. By checking this hash' keys and values you get the data. Apart from the template itself, the hash will contain some additional information: the filename where the template has been read from, the file position the template has started at, et cetera.

3: The functions provided by rr.pm

The module provides ReDIF reading interface through the functions: OpenRDF, OpenDir, OpenDirTree and NextTemplate. The first three of these are for opening the data source (of different types) - they initiate reading. The NextTemplate iterates through the stream of data, opened by an Open... function, allowing you to access actual data, template by template.

3.1: OpenRDF

Use:
OpenRDF  FILENAME
OpenRDF  FILENAME, FILEPOS
OpenRDF  FILENAME, FILEPOS, SHOWFILENAME

This is a basic function, which initiates reading of a ReDIF file FILENAME.

It returns 1 if successful, 0 if not.

FILEPOS parameter (integer number) shall be used to read the file from some specific position. This may be useful if you need to get a specific template and as quickly as possible (and if you know its starting position). By default the file is being read from the beginning, of course.

SHOWFILENAME is how the file should be referred to (logically). For example, this will be the name in the $::HashT{'FILENAME'}. Imagine that you read a file called /home/tim/RePEc/bob/bobseri.rdf. You may want it to be referred to as just bob/bobseri.rdf, because /home/tim/RePEc/ is not important and will be the same for all ReDIF files being read.

&rr::OpenRDF( '/home/tim/RePEc/bob/bobseri.rdf', 0, 'bob/bobseri.rdf');

One may say it is strange and nonsense. I would not argue -- just don't use it, if you don't find it useful.

By default, the logical name is the same as physical FILENAME.

3.2: OpenDir

OpenDir  DIRECTORY
OpenDir  DIRECTORY, SHOWPREFIXLENGTH
OpenDir  DIRECTORY, SHOWPREFIXLENGTH, FILTERSUB

If OpenRDF function starts one-file data stream, this function opens a whole directory of RDF files for consequent reading template by template, without bothering user with opening specific files. So it's a higher level function. The exact filename the template has been read from can be taken from $::HashT{'FILENAME'} variable at each specific template. Only files with 'rdf' extension (case insensitive) are taken.

It returns number of ReDIF files found if successful, 0 if not. You give the pathname of the directory to read in DIRECTORY.

SHOWPREFIXLENGTH parameter is for the same reasons as SHOWFILENAME parameter at OpenRDF. But this value just sets how many of the starting characters of the DIRECTORY to strip as meaningless to get a nice logical name. So if you, say read the directory /home/bob/RePEc/bob/dylan0, and you want logical names of the files to be bob/dylan0/... then give length of /home/bob/RePEc/ as a SHOWPREFIXLENGTH. e.g.,

rr::OpenDir ('/home/tim/RePEc/bob/dylan0' , length ('/home/tim/RePEc/') );

By default, the logical name is the same as physical full filename.

If FILTERSUB parameter specified, it sets the user-defined filter function to choose templates of specific type (criteria) from the whole stream. This feature is described below in a separate section.

3.3: OpenDirTree

OpenDirTree  DIRECTORY
OpenDirTree  DIRECTORY, SHOWPREFIX
OpenDirTree  DIRECTORY, SHOWPREFIX, FILTERSUB

This function is very much like the previous one, but it does a sub-directory tree search for all RDF files. And similarly it returns number of ReDIF files found if successful, 0 if not.

SHOWPREFIX and FILTERSUB parameters have the same meaning as for OpenDir function.

3.4: NextTemplate

This function iterates current data stream (previously opened) to the next template. Returns 1 when successful, 0 if fails. In case of one-file stream OpenRDF, 0 returned from the NextTemplate would mean that you have reached the end of the file, in case of multi-file streams OpenDir and OpenDirTree it means that you have reached the last correct templates of the last found ReDIF file found.

4: How to access the data

There are two ways of accessing the data in your scripts. You may choose to use one of them or both (or none) by setting appropriate options in the hash %rr::Options. The selected modes will influence the way reading goes internally and each way of accessing the data takes some run-time resources to prepare.

Two major forms of the data presentation: full-text (rere-like) in one multi-line string variable and a structured (complex, nested) hash variable, structured to reflect template logic - clusters. Both forms allow processing of one template by a time only.

Let's discuss the hash form first.

4.1: %::HashT structure

When this method (data form) is enabled, after a successful NextTemplate function, %::HashT will be filled up with data according to the following rules.

4.1.1: Rule (a): non-cluster attributes

Simple (non-cluster) attributes of the read template become keys of the %::HashT and these keys have the values associated with them. The names of keys are converted to lowercase. For example, Handle is a simple, non-cluster attribute. After reading a template, $::HashT{'handle'} (mind lower case!) will always give you the handle. For other simple attributes like Title, Abstract, Creation-Date (for Paper or Article templates), Name (Archive, Series), Template-Type and many others, it is exactly the same. Example:

print "Paper: $::HashT{'title'} ($::HashT{'handle'})\n";
...
print "Archive name: $::HashT{'name'}\n";

If $T is a reference to the hash, then you may use it to access the data from %::HashT but writing a bit less code. For example:

  # Assigning to the $T variable a value of
  # the reference to %::HashT
  $T = \%::HashT;

  ...
  # some rr::Open... function call is supposed here
  ...

  return if !&rr::NextTemplate;  # loading a new template

  # checking the template type:

  if ($T->{'template-type'} eq 'redif-paper 1.0') {
      ...
      print "Keywords: $T->{'keywords'}\n";

  } elsif ($T->{'template-type'} eq 'redif-archive 1.0') {
      ...
      print "Archive description: $T->{'description'}\n";
  }

In this piece of code you can observe example of using a reference to %::HashT instead of using %::HashT itself. After reading a next template the script processes it depending on it's type.

4.1.2: Rule (b): clusters

Clusters of templates are repeatable. They may be repeated several times in one template and the same attributes of each repeated cluster should not be mixed up, as they are supposed to only have sense in a cluster. Each cluster of a template read by rr.pm will be represented as an array element. For example, document's authors referred to by Author-Name attribute and more generally described by Author-(*PERSON) cluster will be represented in the hash %::HashT as elements of the @$::HashT{'author'} array.

This means that 'author' key of %::HashT will point to (reference to) an array, which has as many entries as many authors there were. Each value will point to an independent hash, containing author's (*PERSON) cluster data.

For example,

  ...
  # assume a paper template has been succesfully loaded
  print "The paper has " , $#$::HashT{'author'}+1, " author(s):\n";

  # now iterate through each author
  foreach $au ( @$::HashT{'author'} ) {
        # now $au contains a reference to an author's data
        ... $au -> ...
        ...
  }

or
  # initializations
  $T = \%::HashT;
  $authors = $T->{'author'};

  # an effort to be correct in English: checking the number of authors
  if ($#$authors > 0) { $suffix = 's'; }
  else { $suffix = ''; }

  print "Author$suffix:\n";

  # same as in previous example:
  foreach $au ( @$authors ) {
     ....
  }

4.1.3: Rule (c): attributes in clusters

Each cluster data will also be coded as a hash with cluster attributes as keys, similar to the rule (a). Clusters attributes will have cluster prefix (e.g. author-) stripped off. For instance, $::HashT{'author'}->[0]->{'name'} will give you the first author's name, Author-Name attribute's value). Note that the -> are here for verbosity. Perl allows to miss them. If there are more than just one author, then $::HashT{'author'}[1]{'postal'} will give second author's address, i.e. the author-postal attribute in the second author cluster).

And this rule is valid for all clusters at all levels. This mean that if we have a cluster in a cluster (like workplace-(*ORGANIZATION) cluster in a (*PERSON) cluster), than the later cluster's own hash will point you to the second-level cluster hash. A long expression like $::HashT{'editor'}[1]{'workplace'}[0]{'postal'} would make sense to specify address of the first workplace of the second editor of a series.

4.1.4: Other hash %::HashT elements

The hash will also contain some additional information that is local to your site. Upper-case keys are used for these purposes. At the moment the following information will be provided:

%::HashT{'FILENAME'} - name of a file, that template has been read from

%::HashT{'STARTFPOS'} - an integer number, signifying starting position of a template in the file

$::HashT{'BUFFER'} - this key will be assigned a value if only the 'Buffer' option is turned on. It will contain a multi-line string with full-text pre- processed ReDIF template, just read. This is actually the way how the second form of data can be accessed.

And one more technical upper-case key will be engaged at sub-hash (clusters) level: PREFIX. e.g., $T->{'file'}[0]{'PREFIX'} or $T->{'author'}[0]{'PREFIX'}. This key will give you the cluster attributes' prefix: 'file-' and 'author-' respectively for the given above examples. You may use the prefix and cluster's hash keys to get the original attributes of the template by uniting them in one string.

More upper-case keys of %::HashT may be used for internal or other reasons in the future as software development goes on. So user scripts can use the keys of %::HashT as listed above, but should ignore any other of them.

4.2: rere-like output

Rere-like output of rr.pm is an option, which was primary implemented for easy upgrade from rere to rr.pm. Though you may rely on it as much as you like - I don't think it is going to be modified in the future. With this method you get the whole template in one string variable - each line contains one attribute, each attribute is separated from each value by a ': ' combination, each line is separated from each other by a "\n" character, no extra spaces, tabulation or new-line characters. All multi-line values are converted to single-line. All attribute/value pairs go checked and pre-processed.

By default this data supply method is disabled. To enable this method you need to turn on the 'Buffer' option of the %rr::Options (before opening a file):

  $rr::Options {'Buffer'} = 1;
  ...
  &rr::OpenRDF ( $file ) || die;

You can switch this option between reading different files if you need, but I can't guarantee you positive results if you switch it while reading one file.

After a successful &rr::NextTemplate, you get a template in a string as described above in a $::HashT{'BUFFER'}.

For example

print "\$::HashT\{'BUFFER'} = '$::HashT{'BUFFER'}' ;\n"; 

will produce the following output (for instance):

$::HashT{'BUFFER'} = 'template-type: ReDIF-Series 1.0
name: CEP Discussion Papers
description: Discussion papers on Macroeconomics and Labour Economics
keywords: Macroeconomics, Labour
editor-name: Richard Layard
publisher-name: Centre for Economic Performance and ESRC
publisher-homepage: http://cep.lse.ac.uk/
maintainer-name: Anita Bardhan-Roy
maintainer-email: a.bardhan-roy@lse.ac.uk
handle: RePEc:cep:cepdps' ;

5: rr.pm Options

You as a user of rr.pm module can influence some aspects of the way it works. I have already mentioned several of the options. Here comes full description:

'HashT'

Default: enabled (1)

Meaning: This options sets whether to build the %::HashT variable from the template attributes and values. If this is enabled, the full template data: attributes\values will be put into the %::HashT. If disabled, only 'FILENAME', 'STARTFPOS' and 'BUFFER' keys will have a value in %::HashT.

'Buffer'

Default: disabled (0)

Meaning: This option sets whether you want to get a full-text of a template in %::HashT{'BUFFER'}. By default, it is disabled.

You may enable both of this options, but I recommend you to choose only what is necessary, to avoid performance losses.

'BufferEmpty'

Default: disabled (0)

Meaning: If 'Buffer' is enabled this option sets how to treat empty attributes (with null value). If enabled, empty-value attributes will be included into the %::HashT{'BUFFER'} and as a %::HashT{...} lower case attribute, otherwise it would be ignored as meaningless.

'ReadX-Attr'

Default: disabled (0)

Meaning: Whether to process and show to your script the attributes, starting with 'X-' sequence. If disabled, X- attributes will be ignored. If enabled - they will be included into %::HashT{'BUFFER'} and as a %::HashT{...} lower case attribute.

Other options exist only for special reasons and employed by me with scripts rech and rere and normally they should not be interesting to the rr.pm users.

6: Search filters

When you use OpenDir and OpenDirTree functions for accessing a bunch of ReDIF files, you may set a filter for the templates. Such a filter may guarantee, that while going through an opened data stream with NextTemplate, you will only get the templates which meet a certain criteria. For example, you may want to choose templates by type: archive, series, paper, article, software and so on.

To execute such a search with filter, you can prepare a function, that checks the criteria you need and returns the result. If the template is fine, it returns true (e.g. 1), if not it return zero or the undefined value.

This is absolutely simple. When you call NextTemplate subroutine, it will find a next piece of data for you and then run the filter you've set. If filter says Ok, then NextTemplate will allow your programme to process it. If filter refuses by a false value, NextTemplate will look for a next one template to offer. So in fact, setting the filter may cause the module to read many templates in vain, wasting precious run-time resources. But some peopl may find it efficient from programming efficiency point of view: convenient, easy, etc.

Here is a small example of using the filter. It should make things clearer.

  # $RepecRemo = mirrored RePEc archives data directory

 # this is a filter function that checks a criteria
 sub articlefilter {
     return 1 if $T->{'template-type'} eq 'redif-article 1.0';
     return 0;
 }

 sub checkopendirtree {
     my $f, $c, $l = length ("$RepecRemo") + 1;
     $T = \%::HashT;

     # executing a search with a filter here:
     $f = &rr::OpenDirTree ( "$RepecRemo/cre/", $l, \&articlefilter )  ;

     print "\nOpenDirTree $RepecRemo/cre/ : found ", $f ,
          " .RDF file entries\n" ;

     # processing ... (only article templates will get here)
     while ($c = &rr::NextTemplate ) {
        print $T->{'FILENAME'} , ' : ' , $T->{'handle'}, "\n";
        $count ++;
     }
 }

This checkopendirtree subroutine will seek for ReDIF files in the cre archive directory and in its subdirectories and will report you filenames and handles of each article template found (if there are any).


Ivan Kurmanov <ivan@tm.minsk.by>