Brisbane Protocol

by Thomas Krichel

version of 2007–08–31

0. Introduction

This is the Brisbane Protocol proposal. Thomas Krichel started to work on it on 2007–07–03.

Other, obsolete versions are

version of 2007–07–29
version of 2007–08–15

1. Motivation

This proposal is motivated to regulate the integration of documents furnished by the institutional repositories (IRs) into RePEc.

This process has been discussed in terms of confused terminology and logic. This document seeks to clarify.

The terms of discussion here abstract from RePEc. The protocol is aimed to be generic to all sorts of subject-based collections that would like to include contents from IRs, because a standard procedure to incorporate IR contents in subject-based aggregator appears to be useful. Thomas Krichel is not aware of a standard set the RePEc community could readily adopt.

2. Definitions and theory

A subject-based collection (henceforth: subac) holds a collection of metadata of interest to some group.

The provenance of documents is a key issue in a subac. A subac must have a list of sources, from which it compiles its own divisions. Here a source is something like an IR for example, that supports its own metadata rules, collection principles and metadata exposure protocols. A division is something that is the format expected by the subac and that supports and identification scheme that the subac can do something with.

In addition, any subac known to man makes a logical partitioning of documents into series. Usually series are partitions of divisions.

An IR make documents available. Any such document is called an ird in the following.

Every ird has a mode. There are two mode “internal” and “external”.

An IR may group its documents into different series. Then, a subac will group the IR into a division, and series of the IR into series of the division that represents the IR.

An ird is internal if it is the definitive version of a document in its series. An ird is external otherwise.

A couple of examples may help to illustrate this. Assume an author publishes an article in a journal. She uploads a copy of this article into the IR. This is an external ird. On the other hand a department in an institution may publish a series of reports. It may delegate the technical infrastructure of this process to the IR. In that case all ird that belong to the published series are internal irds.

External ird can have two presentations. These presentations are called freestanding and boundstanding.

In a freestanding representation, as made in a user interface, an extrenal ird can be presented on its own. On the page of the external ird say it is part of the papers uploaded in the IR, under the series that the subac considers appropriate for it. The ird may be related to a document in the subac.

In a boundstanding representation, the ird is part of another subac record. As far as the subac is concerned, it has no representation of its own. The metadata from the boundstanding external ird has to be related to a record in the subac to be meaningful.

General implementation

We assume that the IR implements version 2.0 of the OAI protocol for Public Metadata Harvesting, henceforth OAI-PMH.

If a source needs to make records available to a subac, it places them in an OAI-PMH set with the same name as the identifier of a division or series with the subac. If that identifier can not be used, it is URI encoded. Therefore when a subac harvesting agent accesses a collection from the IR, it will first use the id of the collection as the subac knows it, if this fails to retrieve any record, it will use the URI encoding of the identifier.

All metadata uses the AMF format.

oai identifiers, as used in the OAI-PHM requests and responses, must be identical to the AMF record identifier, if such an identifier is provided.

Internal irds

Internal irds are made available using conventional AMF notation.

Every record in a set must have an identifier that start with the set identifier. Thus, a record for an internal ird can only be in one set.

External irds

A boundstanding external ird is simply represented by it's external handle. In this case, the metadata supplied concerns solely the files provided by the IR.

Here, the target document is represented by its handle in the subac only. Note the usage of ref=, rather than id=.

However, given a potential volatility of handles within the subac's handle structure, it is better to populate the record with metadata that is already in the subac at the time the item is captured. Such data can be gathered with the subac's oai interface, if such an interface exists. In RePEc's case, records would be found with http://oai.repec.openlib.org?verb=GetRecord&metadataPrefix=amf&identifier=handle where handle in the handle of the item described. From that response /OAI-PMH/metadata/amf/text has to be found, and /OAI-PMH/metadata/amf/text@id has to be replaced with /OAI-PMH/metadata/amf/text@ref.

<amf xmlns="http://amf.openlib.org"
          xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
          xsi:schemaLocation="http://amf.openlib.org http://amf.openlib.org/2001/amf.xsd">
<text id="RePEc:ner:irsenr">
   <file>
    <url>...</url>
    <function>...</function>
    <format>...</format>
    <restriction>...</restriction>
   </file>
   <isversionof>
    <text ref="RePEc:sur:surrec:9801">
     <type>preprint</type>
     <title>Growing at Different Rates</title>
     <abstract>We examine a two country world. ....</abstract>
     <date event="created">1998-04</date>
     <classification xsi:type="jel1991">E62 H54 F43</classification>
     <file>
      <url>http://www.econ.surrey.ac.uk/discussion_papers/RePEc/sur/surrec/surrec9801.pdf</url>
      <format>application/pdf</format>
     </file>
     <hasauthor>
     <person ref="RePEc:per:1965-06-05:THOMAS_KRICHEL">
      <name>Thomas Krichel</name>
      <email>krichel@openlib.org</email>
     </person>
    </hasauthor>
    </text>
   </isversionof>
</text>
</amf>

If there is on subac record known for the item, the data can be added in an anonymous AMF noun, which does not carry an id attribute.

<amf xmlns="http://amf.openlib.org"
          xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
          xsi:schemaLocation="http://amf.openlib.org http://amf.openlib.org/2001/amf.xsd">
<text id="RePEc:ner:irsenr:atehuxaoeu">
   <file>
    <url>...</url>
    <function>...</function>
    <format>...</format>
    <restriction>...</restriction>
   </file>
   <isversionof>
    <text>
     <type>article</type>
      <title>On Doctors, Mechanics, and Computer Specialists: The Economics of Credence Goods</title>
       <abstract>Most of us need...</abstract>
      <serial>
       <issue>1</issue>
       <issuedate>2006</issuedate>
       <volume>44</volume>
       <issue>March</issue>
       <journaltitle>Journal of Economic Literature</journaltitle>
      </serial>
      <hasauthor>
       <person>
        <name>Uwe Dulleck</name>
       </person>
       </hasauthor>
      <hasauthor>
       <person>
        <name>Rudolf Kerschbamer</name>
       </person>
      </hasauthor>
     </text>
   </isversionof>
</text>
</amf>

Series data

Here is as a concordance table from ReDIF to AMF, using XPath notation

Name: /amf/collection/title
Handle: /amf/collection@id
Maintainer-Email: /amf/collection/hasmaintainer/person/email
Provider-(ORGANIZATION*): /amf/collection/haspublisher/organization
Maintainer-Phone: /amf/collection/hasmaintainer/person/phone
Maintainer-Fax: /amf/collection/hasmaintainer/person/fax
Maintainer-Name: /amf/collection/hasmaintainer/person/name
Description: /amf/collection/description
Editor-(PERSON*): /amf/collection/haseditor/person
Followup: /amf/collection/isreplacdby/collection@id
Predecessor: /amf/collection/replaces/collection@id

Elements that are not in AMF are presumably of negligible value.

Papers data for external irds

Registration of IRs with RePEc

On OAI-based RePEc archive (henceforth: obra) concerns a set of data that is available to RePEc archive.

Registration of IRs uses the special address ftp://all.repec.org/oai

There RePEc makes available a set of files. All names of files that contain IR access data have the ending .amf.xml. Each file describes an obra. The name of the file start with the handle of the obra, without the leading RePEc:.

Each file contains one AMF collection noun describing the archive, which nests other collection nouns describing the series. To fix ideas, let the AMF data provide information about archive RePEc:ner, belonging to NEREUS.

<amf xmlns="http://amf.openlib.org"
          xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
          xsi:schemaLocation="http://amf.openlib.org http://amf.openlib.org/2001/amf.xsd">
<collection id="RePEc:ner">
   <title>NEREUS</title>
   <homepage>http://www.nereus.info<homepage>
   <url>http://oai.nereus.info?verb=GetRecord&id=RePEc:abc&metaDataPrefix=amf</url>
   <hasmaintainer>
    <person>
     <email>info@nereus.org</email>
    </person>
   </hasmaintainer>
   <haspart>
    <collection id="RePEc:ner:tilbir">
     <title>Tilburg RePEc IR papers</title>
     <url>http://oai.nereus.info?verb=ListRecords&metaDataPrefix=amf&set=RePEc:ner:tilbir</url>
    </collection>
   </haspart>
</collection>
</amf>

The URL adjective (in AMF terms) of the main collection noun must be the value of an OAI PMH URL, which, in its OAI payload, delivers the entire AMF record for the archive, including all the series.

The URL adjective in the series go to ListRecords verbs that deliver, in their OAI payloads, all records for the all ids in the series.

In this allocation of series, the series do not need to be catered for by one OAI interface. In fact, series can be contained in a variety of OAI interfaces. The key is that the collection handles have to be prefix