Brisbane Protocol

by Thomas Krichel

version of 2007–07–29

0. Introduction

This is the Brisbane Protocol proposal. Thomas Krichel started to work on it on 2007–07–03.

This version is obsolete. Here is the current version.

1. Motivation

This proposal is motivated to regulate the integration of documents furnished by the institutional repositories (IRs) into RePEc.

This process has been discussed in terms of confused terminology and logic. This document seeks to clarify.

To Thomas Krichel the IR development remains politically unpleasant, because it robs academic authors from a fundamental freedom to publish work where they see fit.

Having said that, RePEc aims to be an inclusive project that does not take sides in political arguments. Therefore

The terms of discussion here abstract from RePEc. The protocol is aimed to be generic to all sorts of subject-based collections that would like to include contents from IRs.

2. Definitions and theory

A document has authors and publishers.

In the publication process documents are grouped into series. Series have a perceived value, i.e. there are high-quality and low-quality series. Authors gain reputation from their document being included in high-quality series.

Grouping of documents is series is part of the publication process. Usually, only one version of a document is part of a given series. But there be several version belonging to a different series.

Publishers, as understood in this document, hold the power of inclusion of the document in a series. This is different from the broad meaning of the term that suggests that publishers are individuals or organizations that make documents public.

An IR is not a publisher. However, in IR may be acting as an agent of a publisher for certain parts of its contents.

A “ird” is a document held in an institutional repository.

Every ird has a mode. There are two mode “internal” and “external” Usually groups of documents have the same mode.

An ird is internal if it is the definitive version of a document in its series. An ird is external otherwise.

A couple of examples may help to illustrate this. Assume an author publishes an article in a journal. She uploads a copy of this article into the IR. This is an external ird. On the other hand a department in an institution may publish a series of reports. It may delegate the technical infrastructure of this process to the IR. In that case all ird that belong to the published series are internal irds.

The distinction between internal an external irds is important because of the reputation that are attached to different series. Therefore, it has to be preserved.

A subject-based collection (henceforth: subac) holds a collection of metadata of interest to some group.

The provenance of documents is a key issue in a subac. A subac must have a list of sources, from which it compiles its own archives. A source is something like an IR for example. An archive is something that is the format expected by the subac and that supports and identification scheme that the subac can do something with.

If that source has not been purpose-built for the subac, there has to be some selection for the source into the subac's archive. If the source is large, harvesting the source to get to the archive is a problematic proposal.

Therefore sources have to be registered within the subac. If the subac registers sub-divisions of source, these also have to be registered. The way this is done depends on the subac. In section ?? we have a proposal for RePEc.

General implementation

We assume that the IR implements version 2.0 of the OAI protocol for Public Metadata Harvesting, henceforth OAI-PMH.

If a source needs to make records available to a subac, it places them in an OAI-PMH set with the same name as the identifier of the collection with the subac. If that identifier can not be used, it is URI encoded. Therefore when a subac harvesting agent accesses a collection from the IR, it will first use the id of the collection as the subac knows it, if this fails to retrieve any record, it will use the URI encoding of the identifier.

All metadata uses the AMF format.

oai identifiers, as used in the OAI-PHM requests and responses, must be identical to the AMF record identifier, if such an identifier is provided.

Internal irds

Internal irds are made available using conventional AMF notation.

Every record in a set must have an identifier that start with the set identifier. Thus, a record for an internal idr can only be in one set.

Registration of IRs with RePEc

Registration of IRs uses the special address ftp://all.repec.org/oai

There RePEc makes available a set of files. All names of files that contain IR access data have the ending .amf.xml. Each file describes one RePEc archive provided by a collection of IRs. The name of the file start with the handle of the archive, without the leading RePEc:.

Each file contains one AMF collection noun describing the archive, which nest other collection nouns describing the series. To fix ideas, let the AMF data provide information about archive RePEc:ner, belonging to NEREUS.

<amf xmlns="http://amf.openlib.org"
          xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
          xsi:schemaLocation="http://amf.openlib.org http://amf.openlib.org/2001/amf.xsd">
<collection id="RePEc:ner">
   <title>NEREUS</title>
   <homepage>http://www.nereus.info<homepage>
   <url>http://oai.nereus.info?verb=GetRecord&id=RePEc:abc&metaDataPrefix=amf</url>
   <hasmaintainer>
    <person>
     <email>info@nereus.org</email>
    </person>
   </hasmaintainer>
   <haspart>
    <collection id="RePEc:ner:tilbir">
     <title>Tilburg RePEc IR papers</title>
     <url>http://oai.nereus.info?verb=ListRecords&metaDataPrefix=amf&set=RePEc:ner:tilbir</url>
    </collection>
   </haspart>
</collection>
</amf>

The URL adjective (in AMF terms) of the main collection noun must be the value of an OAI PMH URL, which, in its OAI payload, delivers the entire AMF record for the archive, including all the series.

The URL adjective in the series go to ListRecords verbs that deliver, in their OAI payloads, all records for the all ids in the series.