Brisbane Protocol

by Thomas Krichel

version of 2007–08–15

0. Introduction

This is the Brisbane Protocol proposal. Thomas Krichel started to work on it on 2007–07–03.

This version is obsolete. Here is the current version.

1. Motivation

This proposal is motivated to regulate the integration of documents furnished by the institutional repositories (IRs) into RePEc.

This process has been discussed in terms of confused terminology and logic. This document seeks to clarify.

To Thomas Krichel the IR development remains politically unpleasant, because it robs academic authors from a fundamental freedom to publish work where they see fit.

Having said that, RePEc aims to be an inclusive project that does not take sides in political arguments. Therefore a standard procedure to incorporate IR contents in subject-based aggregator appears to be useful. Thomas Krichel is not aware of a standard set the RePEc community could readily adopt.

The terms of discussion here abstract from RePEc. The protocol is aimed to be generic to all sorts of subject-based collections that would like to include contents from IRs.

2. Definitions and theory

A document has substantial qualities in itself. Such qualities include author names, title, abstract. The publishing process data is not a part of the fundamental qualities.

Two documents are called versions of each other if they contain the same fundamental characteristics.

As a basic set of non-fundamental characteristics, every document has one publisher and one series.

In the publication process documents are grouped into series. Series have a perceived value, i.e. there are high-quality and low-quality series. Authors gain reputation from their document being included in high-quality series.

Grouping of documents into series is part of the publication process. Usually, only one version of a document is part of a given series. But there be several version belonging to different series.

Publishers, as understood in this document, hold the power of inclusion of the document in a series. This is different from the broad meaning of the term that suggests that publishers are individuals or organizations that make documents public.

In principle, a pure IR should not be considered a publisher. All documents that are deposited have a series nature that is outside the IR and that the IR wishes to advertise. However, an IR may also be used as an agent for a publisher to store data in.

A “ird” is a document held in an institutional repository.

Every ird has a mode. There are two mode “internal” and “external” Usually groups of documents have the same mode.

An ird is internal if it is the definitive version of a document in its series. An ird is external otherwise.

A couple of examples may help to illustrate this. Assume an author publishes an article in a journal. She uploads a copy of this article into the IR. This is an external ird. On the other hand a department in an institution may publish a series of reports. It may delegate the technical infrastructure of this process to the IR. In that case all ird that belong to the published series are internal irds.

The distinction between internal an external irds is important because of the reputation attached to different series.

A subject-based collection (henceforth: subac) holds a collection of metadata of interest to some group.

The provenance of documents is a key issue in a subac. A subac must have a list of sources, from which it compiles its own archives. Thus a source is something like an IR for example, that supports its own metadata rules, collection principles and metadata exposure protocols. An archive is something that is the format expected by the subac and that supports and identification scheme that the subac can do something with.

If that source has not been purpose-built for the subac, there has to be some selection for the source into the subac's archive. If the source is large, harvesting the source to get to the archive is a problematic proposal.

Therefore sources have to be registered within the subac. If the subac registers sub-divisions of source, these also have to be registered. The way this is done depends on the subac. In section ?? we have a proposal for RePEc.

General implementation

We assume that the IR implements version 2.0 of the OAI protocol for Public Metadata Harvesting, henceforth OAI-PMH.

If a source needs to make records available to a subac, it places them in an OAI-PMH set with the same name as the identifier of the collection with the subac. If that identifier can not be used, it is URI encoded. Therefore when a subac harvesting agent accesses a collection from the IR, it will first use the id of the collection as the subac knows it, if this fails to retrieve any record, it will use the URI encoding of the identifier.

All metadata uses the AMF format.

oai identifiers, as used in the OAI-PHM requests and responses, must be identical to the AMF record identifier, if such an identifier is provided.

Internal irds

Internal irds are made available using conventional AMF notation.

Every record in a set must have an identifier that start with the set identifier. Thus, a record for an internal ird can only be in one set.

External irds

A fully external ird is simply represented by it's external handle. In this case, the metadata supplied concerns solely the files provided by the IR.

<amf xmlns="http://amf.openlib.org"
          xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
          xsi:schemaLocation="http://amf.openlib.org http://amf.openlib.org/2001/amf.xsd">
  <text id="RePEc:ner:irsenr">
   <file><!-- full text data -->
    <url>...</url>
    <function>...</function>
    <format>...</format>
    <restriction>...</restriction>
   </file>
   <isversionof>
    <text ref="RePEc:sur:surrec:9801"/>
   </isversionof>
  </text>
</amf>

Here, the target document is represented by its handle in the subac only. Note the usage of ref=, rather than id=.

However, given a potential volatility of handles within the subac's handle structure, it is better to populate the record with metadata that is already in the subac at the time the item is captured. Such data can be gathered with the subac's oai interface, if such an interface exists. In RePEc's case, records would be found with http://oai.repec.openlib.org?verb=GetRecord&metadataPrefix=amf&identifier=handle where handle in the handle of the item described. From that response /OAI-PMH/metadata/amf/text has to be found, and /OAI-PMH/metadata/amf/text@id has to be replaced with /OAI-PMH/metadata/amf/text@ref.

<amf xmlns="http://amf.openlib.org"
          xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
          xsi:schemaLocation="http://amf.openlib.org http://amf.openlib.org/2001/amf.xsd">
  <text id="RePEc:ner:irsenr">
   <file><!-- full text data -->
    <url>...</url>
    <function>...</function>
    <format>...</format>
    <restriction>...</restriction>
   </file>
   <isversionof>
    <text ref="RePEc:sur:surrec:9801">
     <type>preprint</type>
     <title>Growing at Different Rates</title>
     <abstract>We examine a two country world. ....</abstract>
     <date event="created">1998-04</date>
     <classification xsi:type="jel1991">E62 H54 F43</classification>
     <file>
      <url>http://www.econ.surrey.ac.uk/discussion_papers/RePEc/sur/surrec/surrec9801.pdf</url>
      <format>application/pdf</format>
     </file>
     <hasauthor>
     <person ref="RePEc:per:1965-06-05:THOMAS_KRICHEL">
      <name>Thomas Krichel</name>
      <email>krichel@openlib.org</email>
     </person>
    </hasauthor>
    </text>
   </isversionof>
  </text>
</amf>

If there is on subac record known for the item, the data can be added in an anonymous AMF noun, which does not carry an id attribute.

<amf xmlns="http://amf.openlib.org"
          xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
          xsi:schemaLocation="http://amf.openlib.org http://amf.openlib.org/2001/amf.xsd">
  <text id="RePEc:ner:irsenr:atehuxaoeu">
   <file><!-- full text data -->
    <url>...</url>
    <function>...</function>
    <format>...</format>
    <restriction>...</restriction>
   </file>
   <isversionof>
    <text>
     <type>article</type>
      <title>On Doctors, Mechanics, and Computer Specialists: The Economics of Credence Goods</title>
       <abstract>Most of us need...</abstract>
      <serial>
       <issue>1</issue>
       <issuedate>2006</issuedate>
       <volume>44</volume>
       <issue>March</issue>
       <journaltitle>Journal of Economic Literature</journaltitle>
      </serial>
      <hasauthor>
       <person>
        <name>Uwe Dulleck</name>
       </person>
       </hasauthor>
      <hasauthor>
       <person>
        <name>Rudolf Kerschbamer</name>
       </person>
      </hasauthor>
     </text>
   </isversionof>
  </text>
</amf>

Inter

Series data

Here is as a concordance table from ReDIF to AMF, using XPath notation
Name
/amf/collection/title
Handle
/amf/collection@id
Maintainer-Email
/amf/collection/hasmaintainer/person/email
Provider-(ORGANIZATION*)
/amf/collection/haspublisher/organization
Maintainer-Phone
/amf/collection/hasmaintainer/person/phone
Maintainer-Fax
/amf/collection/hasmaintainer/person/fax
Maintainer-Name
/amf/collection/hasmaintainer/person/name
Description
/amf/collection/description
Editor-(PERSON*)
/amf/collection/haseditor/person
Followup
/amf/collection/isreplacdby/collection@id
Predecessor
/amf/collection/replaces/collection@id

Elements that are not in AMF are presumably of negligible value.

Papers data for external irds

Registration of IRs with RePEc

On OAI-based RePEc archive (henceforth: obra) concerns a set of data that is available to RePEc archive.

Registration of IRs uses the special address ftp://all.repec.org/oai

There RePEc makes available a set of files. All names of files that contain IR access data have the ending .amf.xml. Each file describes an obra. The name of the file start with the handle of the obra, without the leading RePEc:.

Each file contains one AMF collection noun describing the archive, which nests other collection nouns describing the series. To fix ideas, let the AMF data provide information about archive RePEc:ner, belonging to NEREUS.

<amf xmlns="http://amf.openlib.org"
          xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
          xsi:schemaLocation="http://amf.openlib.org http://amf.openlib.org/2001/amf.xsd">
  <collection id="RePEc:ner">
   <title>NEREUS</title>
   <homepage>http://www.nereus.info<homepage>
   <url>http://oai.nereus.info?verb=GetRecord&id=RePEc:abc&metaDataPrefix=amf</url>
   <hasmaintainer>
    <person>
     <email>info@nereus.org</email>
    </person>
   </hasmaintainer>
   <haspart>
    <collection id="RePEc:ner:tilbir">
     <title>Tilburg RePEc IR papers</title>
     <url>http://oai.nereus.info?verb=ListRecords&metaDataPrefix=amf&set=RePEc:ner:tilbir</url>
    </collection>
   </haspart>
  </collection>
</amf>

The URL adjective (in AMF terms) of the main collection noun must be the value of an OAI PMH URL, which, in its OAI payload, delivers the entire AMF record for the archive, including all the series.

The URL adjective in the series go to ListRecords verbs that deliver, in their OAI payloads, all records for the all ids in the series.

In this allocation of series, the series do not need to be catered for by one OAI interface. In fact, series can be contained in a variety of OAI interfaces. The key is that the collection handles have to be prefix


Valid XHTML 1.0!