RePEc

Checking

Ivan Kurmanov

1: Instead of Introduction

The following text is a description of ideas, ways and methods of checking data in RePEc. In this section there is some introduction into some RePEc concepts which might be unnecessary for some, but still might be useful for the newcomers. You may skip it to the second section if you are comfortable with terms like ReDIF, template, handle and so on. Otherwise I would recommend you reading first section and then two main documents (mentioned below) about RePEc before reading the rest of this text.

In following sections all main aspects and questions of checking data are gradually discussed and answered.

There are several basic concepts about RePEc system and its data organisation that are necessary to know when reading the following text.

The first is archive and its structure. This is described in the Guildford protocol in full detail, Most of data in RePEc is organised in archives.

An archive usually represents an independent information provider (an organisation), a node in RePEc community network. Each archive may separate its data into several or many series by any criteria. An archive has a unique 3-letter code, which is used to identify it. Series have their own 6-letter code, always unique in a given archive. Data about documents is put into series.

The next important thing is ReDIF and its details. All the data is written (stored) in a special format, called Research Documents Information Format (ReDIF). It is a well-defined and quite developed format with its unique syntax for writing templates. You should consult ReDIF documentation for details if you are going to create or maintain data in RePEc. Its specification URL is http://netec.mcc.ac.uk/RePEc/ReDIF.html.

Each document has a separate corresponding record which is called template: it's a basic information unit in RePEc system. There are different types of templates - for different purposes and for different types of documents.

Handle is a unique identifier of an information unit - a template - in the system. Each template has a handle. Handle is a sequence of non-space characters, which might be used to reference to any document in RePEc without ambiguity. At least it should be. So there are defined rules for handles. A handle consists of several parts. Parts are separated by a colon. Handle always starts with "RePEc:". The rest of a handle depends on which type of template is referenced.

Correct handle of a document's template (e.g. paper template) will contain in it string "RePEc" and archive's identifier and series' identifier (unique inside the archive) and a document's identifier (unique inside the series). Template handles are assigned by information providers. This complex structure means that separate information providers and different sources of ReDIF data within one provider organisation can assign unique handles to their templates without any conflicts and negotiations, if they follow the rules.

So these are some general principles of the data organisation and its hierarchy.

2: Why Should I Bother with Checking (my) Data ?

The answer is: for your RePEc archive to be processed as valid and to be correctly included in all general RePEc Web-services, like WoPEc and BibEc and the search engine database. And generally to make sure that your data is actually integrated into the dataset. Otherwise there's just no sense to create a RePEc archive and ReDIF data at all. (Then it's a waste of efforts.) When a part of RePEc archive does not conform to the rules (Guildford protocol and ReDIF data format) we cannot guarantee that the archive will be adequately represented and processed in the system. The possibility of the automated machine-readability of the data is the key problem here.

Another problem with data is many cases of invalid or obsolete URLs when electronic documents or other info is referenced to. (This is a widespread problem across the internet not only for RePEc.) If you imagine a user who finds a paper that he thinks is interesting and tries to download the document, but fails due to an incorrect URL, than you'll probably understand why we think this is important. We can't check all information content of a template, but when it comes up to the electronic documents, we take strong effort to avoid confusion of our users by incorrect URLs of the presented in templates documents. So another direction for RePEc data checking is URLs checking (for electronically available through the internet documents).

3: ReDIF Syntax Checking

The first stage of checking the RePEc data is always syntax check. Syntax check is done according to the ReDIF format (see ReDIF documentation). It is executed every time a template is read for any purpose everywhere in our system. It is done automatically by the same (shared) mechanisms as it is implemented in the rech script (see below). If a template is judged as incorrect (containing an error) as a result of a check then this template is skipped, ignored, not processed as if there were no template at all. If this happened to an archive, series, mirror templates of your RePEc archive this may result in full or partial (of one or several series) exclusion of your data from the services until a correct template will be put in place. In case of mirror template an error can cause troubles for the mirroring process.

Syntax check includes checking the correctness of all used in a template attributes and for certain attributes - checking the value of the attribute. Usually it's just a formal check for the correct format of value. For example, an Author-Email attribute value which doesn't have a "@" character in it or has a space character in the middle will generate an error. But for instance if you specify there an address which formally looks like normal email but in fact is obsolete due to some mail system changes in your organisation, the syntax check will not report an error. It will NOT try to check the actual validity of an email account.

Appropriate (similar) checking rules apply to date values (creation-date, revision-date attributes), URL values (file-url attribute), file format values (file-format), handle values (handle and ...-handle, e.g. article-handle), publication status values (publication-status).

Please note that at that stage checking handles is limited just to checking the handles format. The idea of handles is that each template has a unique handle in the system and we are concerned with checking handles for uniqueness, but not at this level.

You can use rech utility script, which is freely available for mirroring or downloading, to run a syntax check on your data.

4: Syntax Checking Script rech

It is a Perl 5 script (as all other scripts we use in our system) which can be found in /all/soft/RePec/all/ subdirectory of your mirrored RePEc directory (.../remo/?). To run rech you need some other files in archive "all", but if you mirror it correctly from the central RePEc machine, then you don't have to do any further installations.

Use "-h" or "--help" command line options to see a brief help on usage.

Like other RePEc scripts rech needs a correct REPECDIR environment variable to be set on the system or to be started from that directory. Alternative way is to specify your RePEc directory as an option: just add "--rdir <directory>" to the command line, having changed <directory> to the actual path. Knowing the RePEc directory is necessary for the program to work.

To check your files with rech you may use several methods. The first method is to check all files which are in your RePEc archive directory (REPECDIR) and in its subdirectories. This is assumed by default if you run rech without any parameters other than the options which start with a dash character.

If you want to check separate .RDF file(s) or all ReDIF files in some directory(-ies) (different from the yours RePEc archive directory) you should use different method: put names (full pathnames) of the files and/or directories into the command line when executing rech. Separate it with spaces and/or tabs. Each directory on the command line will be checked with all it's subdirectories, recursively.

Rech loads a configuration file (by default <REPECDIR>/conf/rech.conf), where you can set some parameters of the program behaviour.

When rech finds a problem it will report it to standard output with the file name, with line number and some explaining message. If the quoting mode is not turned off (either by -nq parameter or in the configuration file) then rech will also show you original lines of your RDF file (template), prefixed with "> " character combination so that you could see the lines with errors.

If you've created and now maintain your data by a plain text editor, not by an automated conversion from a different source (format), rech may help you to correct found mistakes (problems) in your data files. After each checked file which caused an error message rech can execute an external command. You should set the command line and parameters in the configuration file. Two parameters can be passed on to an external program: name of the checked ReDIF file with errors and name of a temporary error-log file. (When rech starts checking a file it creates a temporary file and writes there all error messages with line numbers.) So if your editor can open several windows (documents) simultaneously (like emacs editor can) you will be able to see all error messages and correct the file easily.

Each archive's data is automatically, regularly checked with rech and results are put in corresponding files in RePEc/anal/ directory on RePEc central machine (netec.mcc.ac.uk). You may use this instead of running checks yourself. Files with reports of rech are named xxx.rech, where xxx is a name of an archive.

5: CHECKING HANDLES AND ARCHIVE STRUCTURE

This check is necessary to support general integrity of the dataset and to figure out mistakes in handles assignment. This task is performed with help of a SQL database, which contains some information about every correct template in the system. This database is updated on a regular basis (now it's weekly). Each handle is checked.

First of all, the handle must be appropriate to the archive and series of the template it was placed in. For example, a template of a paper in archive "xyz" and in series "abcdef" should start with a string "RePEc:xyz:abcdef:". Otherwise an error message will appear.

Each handle (of each template) is searched in the database and the result is analysed. If the handle has previously appeared in some other file or even in the same file then it is considered to be duplicated and error message is generated.

All error messages with filenames and line numbers are put into a special file, separate for each archive. All other found problems are also written there. You can find it in .../RePEc/anal/<xxx>.rela on netec.mcc.ac.uk where <xxx> is an archive identifier. Apart from error messages this file should also contain a line with archive's id and name and a list of all series with names and identifiers. And a resulting line with some statistics (part of which has internal meaning, others are obvious and may be used to control your archive and it's status.

If due to some problems your archive wasn't included into the database and other services, here you'll probably find a reference to the cause. An archive should have one and only one valid archive template and at least one correct series template.

6: URLs CONTROL

[This section is yet to be written.]

Here is an example on how to introduce rech output


 - In </fmg/fmgsps> found file <sp0099.rdf>

> template-type: ReDIF-Paper 1.0
> title: The Inflation Target Five Years On
> author-name: Mervyn King
> author-workplace-name: Bank of England
> author-workplace-postal: Bank of England, Threadneedle Street, London, EC2R 8AH
> classification-jel: 
> creation-date: February 1998

Error: (sp0099.rdf, 12): Invalid value "February 1998" of type 
<date> (attr: creation-date, eval)


Ivan Kurmanov< ivan@tm.minsk.by>>