This document is available in PDF format for US letter size paper and for A4 size paper.
It gives details of how the ReDIF format is being encoded in a special specification file. This file is normally called "redif.spec". This documentation explains how to make changes to a ReDIF specification. One can make enhancements to an existing specification or design a new specification from scratch.
Before trying to understand the specification of ReDIF, one should understand the ReDIF data format documentation well, because this document is like a dark side of the moon for ReDIF:it is better look at the bright side first.
The ReDIF specification file contains a ReDIF specification in a custom-built format called the "SPEcification FOrmat for Redif", abbreviated spefor. The spefor file is loaded each time any data reading starts. It defines:
Each cluster and template has a finite set of allowed attributes, some of which may be required to appear in every template of the type. An attribute in a template or a cluster must have a defined (complete) name and may have some additional properties set. The additional properties of an attribute are: type of value, the permission to omit and the prohibition to repeat the attribute in a template.
The spefor syntax borrows some features of Perl code syntax. Comment lines
start with the hash char #
symbol. Repetitions of whitespace (i.e. blanks
and tabulation characters) are ignored.
Spefor files make extensive use of blocks of text.
Blocks start with the open curly brace {
character and
end with the close curly brace }
character.
Unlike in C or Perl, each block closes with a line consisting of a
single }
character, not taking spaces and tab chars into account.
This means that in the following example:
A {block's code here
} B {other data or code
}
block A
is correctly closed, but the block B
is not. The
difference is a newline character before closing brace }
.
This imposes a limitation on the contents of each block: you can
not put a line with single }
inside a block. Be
careful with Perl code in blocks.
The ReDIF specification is specified by a number of definitions. In other words, definitions are the main structural elements of the spefor file.
A definition states what is being defined and how it will be named and then gives the contents of the definition in a block. The most common syntax is
def_type
=
name
block
'\n'
'\n'
designates a newline character here.
In some cases, you will also see definitions of the syntax
def_type
=
name
[/
parameter
] block
'\n'
The item in square brackets is optional here.
There are three types of definitions. Therefore def_type
variable can take
only three values. These are template
, cluster
and type
. These
values are case-insensitive.
A template type definition lists attributes of a template and sets
their properties. The name
variable of the definition line gives
the template type name in full, like ReDIF-Paper 1.0
for
example. All its attributes are listed inside the definition
block
. Each attribute is listed on a separate line. Each
non-empty line defines one attribute of the template. Its syntax is
attribute_name
[:
[type
][:
[subtype
][:
[required_flag
][:
[non_repeatable_flag
]]]]]
Whitespace characters are not allowed inside this string. They are allowed at the beginning or end of the string. The colon character is a delimiter here. The only part of the line which is required is the attribute_name . type
and subtype have several different uses for different cases, but in general they specify content properties of the attribute. In most cases only type
is given, the subtype parameter has less usage.
The meaning of the required_flag
is to answer the following
question: Is this attribute necessary to consider template or cluster
valid or it is optional? The value 1
in the required_flag
means that the attribute is required.
The meaning of non_repeatable_flag
is to answer the following
question: Can this attribute appear several times in one template or
cluster? By default attributes can be repeated, but if you set
non_repeatable_flag
to 1
, then the attribute will become
non-repeatable.
A special format exists for key attributes of template and cluster
types. If in the required_flag
field of the attribute definition there
is an asterisk '*
' (instead of 1
or anything else), then this attribute
will become the key attribute of the template or the cluster.
In all other cases required_flag and non_repeatable_flag may contain any digits, but normal value for it is 1 or nothing by default.
An example of a valid template type definition:
template = ReDIF-Person 1.0 { Template-Type:::* Handle:::1:1 Name-Full:::1:1 Name-First Name-Last WorkPlace:cluster:Organization:: Email Fax Postal Phone Homepage:URL # Homepage-Publications:URL # Classification-Jel }This example defines the template type
ReDIF-Person 1.0
. The attribute
Template-type
is the key attribute of the template, the Handle
and
Name-Full
attributes are required and non-repeatable. The attribute
Workplace
is of type cluster
and subtype Organization
. The
attribute Homepage
is of type URL
. Lines with
Homepage-Publications
and Classification-Jel
attributes are
commented out. They are not included into the template type.
Clusters are groups of attributes. They are defined outside the template types and may be used as a group in various templates to build composite elements. For example consider the following cluster definition
cluster = address { Country:::* state: county zipcode city:::1:1 street }
Similarly to template type definitions, a cluster type definition
contains a list of attribute names with their properties. This
particular example (above) defines a cluster type named address
.
Elements of a cluster type (attributes) can not be used directly in any template. To use a cluster in some template type, we need to introduce it there through an ordinary attribute. The attributes of the cluster then can appear in the data being prefixed by that attribute and a dash.
For example, this means that if we introduce a cluster of type address
into some template through an attribute Home
then following data lines
may be valid inside such templates:
Home-Country: U.S.A. Home-State: CA Home-City: Santa Barbara
More detailed explanation about how to introduce a cluster into a template will be found in one of the following sections.
Once defined a cluster type can be used in any template or even cluster type. The scope of the cluster type definition's is global and clusters can be nested.
When we list attributes in a template or a cluster type definition
we can specify each attribute as having a certain type
.
There is one special type of attributes which will be described right in
the next section. It is the type cluster
.
cluster
If the type
property of an attribute line takes value cluster
,
then the subtype
shall point to a specific cluster type. For example if we
have defined cluster type course
, then in a template type definition
for a teacher we can put following attribute
line:
teaching:cluster:course
In this case the attribute teaching
in the template type will
represent a cluster of type course
. Attributes of this cluster
type may appear in any template of this type, being prefixed by
teaching-
. For example, look at a template
Template-Type: ReDIF-Teacher 1.0 Name: Thomas Krichel Teaching-Course-Name: Dynamic Macroeconomics Teaching-Course-Class-Size: 25 Teaching-Course-Name: Open Economy Macroeconomics Teaching-Course-Class-Size: 20 Handle: RePEc:per:teachr:T_Krichel
Such a template is possible if the cluster type course
has been defined as
cluster = course { course-name:::* course-class-size }
Exactly the same mechanism can be used to include clusters into other clusters, i.e. clusters can be nested.
All we dealt with up to this moment are the ReDIF template structures. Having defined a template structure is important for parsing and validating data, but sometimes more deep checking of data is desired. And this is what this section is about.
We have shown how an attribute specified in a template type or a
cluster type definition can bring a cluster into a corresponding
structure. We use a special value of cluster
of the type
property of an attribute for that.
In other cases, the type of an attribute is used to specify the attribute's value data type. These types are types of value that this attribute may take. In most cases this property is empty which means that no certain data type has been assigned, and any value is possible.
For example, we can define a data type date
for our date-related
attributes and then declare those attributes as having value type date
.
This will impose some checking and processing rules of the date
data
type on all appearances of the attributes or, to be more correct, on the
values of the attributes. To set those checking and processing rules we
use the Perl programming language. We will be more elaborate on that
subject in the following sections.
Value types define rules by which the attribute values will be processed when a ReDIF data file is being read or checked. Attributes may be associated with a value data type through a cluster or template type definition, as it has been shown in the previous sections.
The value data type definition syntax is
type =
type_name
[/
type_aspect
] block
'\n'
The type_aspect specifies how to interpret the definition's content block. This parameter may take three different values
For example, let us look at a following definition:
type = handle / check-regex { ^RePEc\:[a-zA-Z]{3}(\:[a-zA-Z0-9]{6}(\:[^\s\n]+)?)?$ }
It defines a check-regex
aspect of data type handle
. As a result,
all values of handle
type attributes will be checked to match the
Perl regular expression in the braces above.
Thus the check-regex
aspect allows to set the syntax of attribute values in
the form of a Perl regular expression. Values that don't match the Perl regular
expression would be rejected and the corresponding template will be invalid.
The check-eval
aspect allows to set checking routine code for
values of the data type in Perl language. The code is evaluated at
run-time. Depending on the return value (or last expression, evaluated
in the block), the attribute is accepted or rejected.
In the code one has access to the value which we need to be check by a scalar variable $value . Let's look at an example
type = pubstat / check-eval { if ( $value !~ /^Forthcoming/i or $value !~ /^Published/i ) { return 0; } else { return 1; } }This type is for the Publication-Status attribute, whose value must start with one of
Forthcoming
and Published
case-insensitive.
Here again let us remind that one cannot use single closing brace
}
on a line inside a block. So if we write the same piece in a
different way
type = pubstat / check-eval { if ( $value !~ /^Forthcoming/i or $value !~ /^Published/i ) { return 0; } else { return 1; } }this will cause an error at spefor file (redif.spec) reading.
There is another one interesting feature in check-eval aspect of types. An author or maintainer of data often wants to know why this or that template is rejected. The rech data checking script for this purpose uses so-called messages.
A message has a text (content) and a status. A status is one of several levels: from a debugging note (lowest) to a fatal error message (highest). But the two most often used levels are: a warning and an error. For further technical details please consult rech, rr.pm and messages.pm Perl files.
If during reading a template somewhere appears a message with status of error or higher, then the template will be rejected. Depending on the invocation of rech, the message content may be printed to the standard output.
Should a type checking aspect check-regex or check-eval
executed for a particular value
produce a negative result, the parsing software will generate a
message of a general form, stating that value XXX of type
YYY is invalid
or alike.
If one wants to make for clever and friendly checking for your data, he may want to change this kind of message to a different one, showing exactly what is the problem of the data. To make this in check-eval aspect definition you may yourself initiate a message. To initiate a message you simply call a function msg. But to exclude error duplication, if you initiate an error message in a check-eval block, be sure to return a positive value at the end.
The msg subroutine accepts the message text as the first parameter, and a number - the status level as the second parameter.
The status level of 2 corresponds to a warning, 3 - to an error.
For example, the code in the following type definition may generate an error message when the corresponding value is incorrect.
type = integer / check-eval { if ($value =~ /^\d+$/) { msg "($file, $line): Value <$value> is not an integer", 3; return 1; # as if it was OK ;} }Thus function msg sends a message. The $file and $line variables will be equal to the data filename and the line number position in the file.
Apart from figuring out and reporting errors, you can also give user a warning about data. Let's look into another example
type = URL / check-eval { if ( $value =~ /^(ftp|http):\/\//i and $value !~ /^(ftp|http):\/\// ) { msg "($file, $line): URL capitalization is bad <$value>", 2; $value =~ s/^Ftp/ftp/i; $value =~ s/^Http/http/i; ;} $value =~ /^(URL:)? (ftp|http|gopher)\: \/\/[a-zA-Z0-9\_\.\-]+ (\:[0-9]+)? (\/?|[\~\&\%\^\@\!\'\"\`\(\)\|\#\:\;\/A-Za-z\-\_0-9\?\=\*\\\+\,\.]+)$/x; }
In the first part of this check-eval we check the capitalization of the starting part of a URL (protocol, like http or ftp). The line
msg "($file, $line): URL capitalization is bad \"$value\"", 2;
creates a message with status 2 (Warning). The template will not be ignored. Then we correct the capitalization of the value and go further. In the finishing part of the block we check the value to match a big regular expression (Perl5 extended syntax - note "/x" in the end). The logical value calculated in the result of this matching operator will trigger an error if the value does not match.
Another useful function that a ReDIF parsing software (which is behind rech, rere or rr.pm) may perform is in to convert attribute values from the form in which the author has written it to some other (possibly, standardized) form.
Historically within the RePEc project, a need for this feature arose
with the use of the subject classification codes defined by the
Journal of Economic Literature, the JEL codes. For example, we could
do (and actually did) so, that a sample value of JEL classification
attribute a21; b3, X3
would be automagically converted to A21 B3
at the data reading
stage, `on the fly'. This conversion includes: change to the uppercase
letters, separation by single space characters, and ignoring
(filtering out) the invalid code X3
.
It is implemented using a preproc
value data type aspect. It
works very much like check-eval
. The preproc
type block is
eval
-uated in Perl while reading data, but the resulting
(returned) value does not matter.
Consider another following example:
type = date / preproc { return if $value eq ''; if ($value !~ /-/ ) { $value =~ s/^(\d{4})(\d{2})$/$1-$2/; $value =~ s/^(\d{4})(\d{2})(\d{2})/$1-$2-$3/; } if ($value !~ /^\d{4}(-\d\d){0,2}$/ ) { if ($rr::Options{'MessageOut'}) { include_attrline( ); msg "($file, $line): A bad date value format: \"$value\"", 2; } else { $value = '' ; } return 1; ;} }
This example is for attributes of the date type. Dates are
converted to a standard form or cleared to ''
, if original value is
of incorrect format. But if this code is executed in the data checking
mode (for example, in rech script) which is checked by
$rr::Options{'MessageOut'}
then a warning message is generated.
Also note a function call before creating a message:
include_attrline( );
This is necessary for correct rech output if you generate messages in
preproc
.
Just to clear some possible questions, here we have a brief introduction about how the ReDIF data is being parsed and how the spefor file is being used.
When a ReDIF parsing engine behind rr.pm, rere and rech reads an attribute, it checks some control values i.e. it asks several questions about an attribute. If any of a question leads to a negative result, an error is being created. And such an error causes the whole template to be treated as invalid. The template is ignored and the processing program (the parser's client) cannot see it.
The first thing checked about attribute is whether the attribute is correctly placed. Is it allowed for the current template and/or cluster type(s)? If the attribute is repeated, is it allowed to be repeated here?
If the attribute fits well into the template structure (does not violate the template type), the parser then checks the value of the attribute.
The engine then checks what type of attribute it is (as it is defined
in the template or cluster type). If its type is cluster
, then
the engine opens a new cluster. If it is of some data type, that
type's preproc
aspect is checked and executed (eval
-uated), if
it has been defined in spefor. Then check-regex
is matched if
defined. Next step is check-eval
eval
-uation, you guessed,
if defined.
If at the check-regex
or the check-eval
stage a negative result has
been received, then the engine generates an error and stops further
processing of the template. Actually the same happens if an error appears
in the pre-proc
aspect.
When a new Template-type
attribute is encountered or an end of file is
reached, the template is checked to contain all required attributes of
the template type. Processing the template goes on successfully only if
this check returns positive result. After that the template is considered
to be complete and valid.