RePEc, an Open Library for Economics </td><td>

RePEc, an Open Library for Economics

2000-03-24

Thomas Krichel
Palmer School of Library and Information Science
Long Island University
720 Northern Boulevard
Brookville, New York 11548-1300
USA
Thomas.Krichel@liu.edu

Abstract

After arXiv.org, the RePEc economics library offers the second-largest source of freely downloadable scientific preprints in the world. RePEc has a different business model and a different content coverage than arXiv.org. This paper addresses both differences.

As far as the business model is concerned, RePEc is an instance of a concept that I call the "Open Library". An Open Library is open in two ways. It is open for contribution (third parties can add to it), and it is open for implementation (many user services may be created). Conventional libraries--including most digital libraries--are closed in both directions.

As far as the content coverage is concerned, RePEc seeks to build a relational dataset about scholarly resources and other content relating of to these resources. This basically means the identification of all authors, all papers and all institutions that work in economics. Such an ambitious project can only be achieved if the cost to collect data is decentralized and low, and if the benefits to supply data are large. The Open Library provides a framework where these conditions are fulfilled.

This paper is available in PDF for A4 paper, and for letter size paper.

1: Introduction

Digital or digitisable data is supplied by publishers, to be consumed by readers. Reports of research results in research "papers" form the bulk of academic digital or digitisable data, and I will refer to these as documents in what follows.

In this chapter I am not concerned with the demand for documents, nor am I not concerned with the supply of documents. Instead, I focus on the supply of information about documents. For some documents, holding detailed information about the document is as good as holding the document itself. This is typically the case when the document can be accessed on the Internet without any access restriction. Such a document will be called a public access document. Collecting data about documents is therefore particularly relevant for public access documents.

The main idea that is brought forward in this paper is the "Open Library". Basically, an open library is a collaborative framework for the supply and usage of document data. Stated in this way the idea of the open library is quite trivial. To fully appreciate the concept, it is useful to study one open library in more detail. My example is the RePEc dataset about Economics. In Section 2, I introduce RePEc as a document data collection. In Section 3, I push the RePEc idea further. I discuss the extension of RePEc that allows one to describe the discipline, rather than simply the documents that are produced by the members of the discipline. In Section 4, I make an attempt to define the open library more precisely. The example of RePEc demonstrates the relevance of the open library concept. I conclude the paper in Section 5.

The efforts of which RePEc is the result go back to 1992. I deliberately stayed away from a description of the history of the work to concentrate on the current status. Therefore, insufficient attribution is given to the people who have earned historic merits by contributing to the RePEc effort. See Krichel (1997) for an account of the early history of the NetEc projects. These can be regarded as precursors of RePEc.

2: The RePEc document dataset

2.1: Origin and motivation of RePEc

A scholarly commonication system brings together producers and consumers of documents. For the majority of the documents, the producers do not receive a monetary reward. Their effort is compensated through a wide circulation of the document and a mark of peer approval for it. Dissemination and peer approval are the key functions of scholarly commonication.

Scholarly commonication in economics has largely been journal-based. Peer review plays a crucial role. Thorough peer review is expensive in time. According to Trivedi (1993), it is common that a paper takes over three years from submission to publication in an academic journal, not counting rejections. From informal evidence, slowly rising publication delays have stabilized in the past few years as journal editors have fought hard to cut down on what have been perceived to be intolerable delays.

Researchers at the cutting edge cannot rely solely on journals to keep abreast of the frontiers of research. Prepublication through discussion papers or conference proceedings is now commonplace. Access to this informally disseminated research is often limited to a small number of readers. It relies on the good will of active researchers to disseminate their work. Since good will is in short supply, insider circles are common.

This time gap between informal distribution and formal publication can only fundamentally be resolved by reforming the quality control process. The inconvenience resulting from the delay can however be reduced by improving the efficiency of the informal commonication system. This is the initial motivation behind the RePEc project. Its traditional emphasis has been on documents that have not gone through peer review channels. Thus RePEc is essentially a scholarly dissemination system on the Internet. It is independent of the quality review process.

2.2: Towards an Internet-based scholarly dissemination system

The Internet is a cost-effective means for scholarly dissemination. Many economics researchers and their institutions have established web sites. However, they are not alone in offering pages on the Web. The Web has grown to an extent that the standard Internet search engines only cover a fraction of the Web, and that fraction is decreasing over time (Lawrence and Giles 1999). Since much of the Economics research uses common terms such as "growth", "investment" or "money", it is likely that a subject search on the entire Web would yield an enormous amount of hits. There would be no practical way to find which pages contain economics research. Due to this low signal to noise ratio, the Web per se does not provide an efficient mechanism for scholarly dissemination. An additional classifying scheme is required to segregate references to materials of interest to the economics profession.

The most important type of material relevant to scholarly dissemination are research papers. One way to organize this type of material has been demonstrated by the arXiv.org preprint archive, founded in 1991 by Paul Ginsparg of the Los Alamos National Laboratory with an initial subject area in high energy physics. Authors use that archive to upload papers, which remain stored there. ArXiv.org has now assembled over 150,000 papers, covering a broad subject range of mathematics, physics and computer science, but concentrating on the original subject area. An attempt has been made to emulate the arXiv.org system in economics with the "Economics Working Paper Archive" (EconWPA) based at Washington University in St. Louis. Its success has been limited. There are a number of potential reasons:

  • Economists do not issue preprints as individuals; rather, economics departments and research organizations issue working papers.
  • Economists use a wider variety of document formatting tools than physicists. This reduces the functionality of online archiving and makes it more difficult to construct a good archive.
  • Generally, economists are not known for sophisticated practices in computer literacy and as such, they are more likely to encounter significant problems with uploading procedures.
  • There is considerable confusion as to implications of networked pre-publication on a centralized, high-visibility system for the publication in journals.
  • Economics research is not confined to university departments and research institutes. There are a number of government bodies--central banks, statistical institutes, and other--who contribute a significant amount of research in the field. These bodies, by virtue of their size, have more rigid organizational structures. This makes the coordination required for a central research paper dissemination more difficult.
  • An ideal system should combine the decentralized nature of the Web, the centralized nature of the arXiv.org archive, and a zero price to end users. I discuss these three requirements in turn.

    The system must have a decentralized storage of documents. To illustrate, let us consider the alternative scenario. This would be one where all documents within a certain scope, say within a discipline, would be held on one centralized system. Such a system would not be ideal for three reasons. First, those authors who are rejected by that system would have no alternative publication venue. Since economics is a contested discipline, this is not ideal. Second, the storage and description of documents is costly. The centralized system may levy a charge on contributors to cover its cost. However, since it enjoys a monopoly, it is likely to use this position to extract a rent from authors. This would not be ideal.

    The centralized nature of the arXiv.org system is the ability to have a one-stop-shop where all the papers relevant to economics are accessible. Again, to show that this a requirement for an ideal dissemination system, imagine a situation where that would not be the case. In that case, the search for documents would be difficult and therefore the distribution of documents not optimal.

    To explain why the end-user access to the dissemination system should be free, it is useful to refer to the distinction between trade authors and esoteric authors, as done by Harnad (1995). Authors of academic documents are esoteric authors rather than trade authors. They do not expect payments for the written work; instead, they are chiefly interested in reaching an audience of other esoteric authors and to lesser extent, the public at large. Therefore the authors are interested in wide dissemination. If a tollgate to the dissemination system is set-up, then the system as such falls short of an ideal one.

    Having established the three criteria for an ideal system, let me turn to the problem of implementing it. The first and third objectives could be accomplished if departments and research centers allow for public access to their documents on the Internet. But for the second, we need a library to hold an organized catalog. The library would collect what is known as "metadata": data about documents that are available using Internet protocols. There is no incentive for any single institution to bear the cost of establishing a comprehensive metadata collection, without external subsidy. However, since every institution will benefit from participation in such an effort, we may solve this incentive problem by creating a virtual collection via a network of linked metadata archives. This network is open in the sense that persons and organizations can join by contributing data about their work. It is also open in the sense that user services can be created from it. This double openness promotes a positive feedback effect. The larger the collection's usage, the more effective it is as a dissemination tool, and thus more authors and their institutions join as participation is open. The larger the collection, the more useful it becomes for researchers. This leads to more usage.

    Bringing a system to such a scale is a difficult challenge. Man is an animal of habit. Scholarly commonication systems have evolved time. Academic careers are directly dependent on the results of the scholarly commonication. Therefore, change in the this area is slow because it involves important aspects of the lives of those who are the potential implementors of the change. A scholarly dissemination system on the Internet is more likely to succeed if it enhances current practice, but it does not replace it. The distribution of informal research papers in the past has been based on institutions issuing working papers. These are circulated through exchange arrangements. RePEc is a way to organize this process on the Internet.

    2.3: The architecture of RePEc

    RePEc can be understood as a decentralized academic publishing system for the economics discipline. RePEc allows researchers' departments and research institutes to participate in a decentralized archival scheme which makes information about the documents that they publish accessible via the Internet. Individual researchers may also openly contribute, but they are encouraged to use EconWPA.

    Each contributor needs to maintain their own collection of data using a set of standardized templates. Such a collection of templates is called an "archive". An archive operates on an anonymous ftp server or a Web server controlled by the archive provider. Each archive provider has total control over the contents of its archive. There is no need to transmit documents elsewhere. The archive management retains the liberty to post revisions or to withdraw a document.

    2.3.1: An example archive

    Let us look at an example. The archive of the OECD is at http://www.oecd.org/eco/RePEc/oed/. In that directory we find two files. The first is oedarch.rdf:

    Template-Type: ReDIF-Archive 1.0
    Handle: RePEc:oed
    Name: OECD Economics Department
    Maintainer-Email: eco.contact@oecd.org
    Description: The working papers of the 
     Economics Department of the OECD
    URL: http://www.oecd.org/eco/RePEc/oed
    

    This file gives basic characteristics about the archive. It associates a handle with it, gives an email address for the maintainer, and most importantly, provides the URL where the archive is located. This archive file gives no indication about the contents of the archive. The contents list is in a second file, oedseri.rdf:

    Template-type: ReDIF-Series 1.0
    Name: OECD Economics Department working
      papers
    Type: ReDIF-Paper
    Provider-Name: OECD Economics Department
    Provider-Homepage: 
     http://www.oecd.org/eco/eco/
    Maintainer-Email: eco.contact@oecd.org
    Handle: RePEc:oed:oecdec
    

    This file lists the content as a series of papers. It associates some provider and maintainer data with the series, and it associates a handle with the series. The format that both files follow is called ReDIF. It is a purpose-built metadata format. Appendix 2 discusses technical aspects of the ReDIF metadata format that is used by RePEc. See Krichel (2000) for the complete documentation of ReDIF.

    The documents themselves are also described in ReDIF. The location of the paper description is found through appending the handle to the URL of the archive, i.e+dot at http://www.oecd.org/eco/RePEc/oed/oecdec. This directory contains ReDIF descriptions of documents. It may also contain the full text of documents. It is up to the archive to decide whether to store the full text of documents inside or outside the archive. If the document is available online--inside or outside the archive--a link may be provided to the place where the paper may be downloaded. Note that the document may not only be the full text of an academic paper, but it may also be an ancillary files, e.g. a dataset or a computer program.

    Participation does not imply that the documents are freely available. Thus, a number of journals have also permitted their contents to be listed in RePEc. If the person's institution has made the requisite arrangements with publishers (e.g. JSTOR for back issues of Econometrica or Journal of Applied Econometrics), RePEc will contain links to directly access the documents.

    2.3.2: Using the data on archives

    One way to make use of the data would be to have a web page that lists all the available archives, and allow users to navigate through the archives on the search for documents that they may be interested in. However, that would be quite a primitive way to access the data. First, the data as shown in the ReDIF form is not itself hyperlinked. Second, there is no search facility, no filtering of contents, etc..

    The provision of services that allow for convenient access of users is not a concern for the archives, but for user services. User services render the RePEc data in a form that make it convenient for a user. User services are operated by members of the RePEc commonity, libraries, research projects etc.. Each service has its own name. There is no "official" RePEc user service. A list of services in at the time of writing may be found in Appendix 1.

    User services are free to use RePEc data in whatever way they see fit, as long as they observe the copyright statement for RePEc. This statement places some constraints on the usage of RePEc data:

    
    You are free to do whatever you want with 
    this data collected on the archives that
    are described here,  provided that you
    
    (a) Don't charge for it or include it in
        a service or product that is not free
        of charge.
    
    (b) When displaying the contents of a 
        template (or part of a template) the 
        following fields must be shown if
        they are present in the template: 
        Title, Author-Name, File-Restriction
         and Copyright (if present).
    
    (c) You must contribute to RePEc by 
        maintaining an archive that actively 
        contributes material to RePEc.
    
    (d) You do not contravene any copyright
        statement found in any of the 
        participating archives.
    
    

    Within the constraints of that copyright statement, user services are free to provide all, or only a subset of, the RePEc data. For example, one service may only show papers that are available electronically, another may restrict the choice to act as a quality filter. In this way services implement constraints on the data, whether they be availability constraints or quality constraints.

    The RePEc data may not be sold or incorporated into a product that is sold. Therefore all RePEc services are free. User services compete through quality rather than price. All RePEc archives benefit from simultaneous inclusion in all services. This leads to an efficient dissemination that a proprietary system can not afford.

    2.3.3: Building user services

    The provision of a user services usually starts with putting frequently updated copies of RePEc archives on a single computer system. This maintenance of a frequently updated copy of archives is called "mirroring". Everything contained in an archive may be mirrored. For example, if a document is in the archive, it may be mirrored. If the archive management does not wish the document to be mirrored, it can store it outside the archive. The advantage of this remote storage is that the archive maintainer will get a complete set of access logs to the file. The disadvantage is that every request for the file will have to be served from the local archive rather than from the RePEc site that the user is accessing.

    An obvious way to organize the mirroring process overall would be to mirror the data of all archives to a central location. This central location would in turn be mirrored to the other RePEc sites. The founders of RePEc did not adopt that solution because it would be quite vulnerable to mistakes at the central site. Instead each site installs the mirroring software and mirrors "on its own", so to speak. Not all of them adopt the same frequency of updating. Some may update daily, while some may only update weekly. One disadvantage of this system is that it is not known how long it takes for a new item to be propagated through the system.

    2.4: The documents available through RePEc

    Over 160 archives in 25 countries currently participate in RePEc, some of them representing several institutions. Over 100 universities contribute their working papers, including U.S. institutions such as Berkeley, Boston College, Brown, Maryland, MIT, Iowa, Iowa State, Ohio State, UCLA, and Virginia. The RePEc collection also contains information on all NBER Working Papers, the CEPR Discussion Papers, the contents of the Fed in Print database of the US Federal Reserve, and complete paper series from the IMF, World Bank and OECD, as well as the contributions of many other research centers worldwide. Last, but not least, RePEc also includes the holdings of EconWPA. In total, at the time of writing in March 2001, over 37,000 items are downloadable.

    The bibliographic templates describing each item currently provide for papers, articles, and software components. The article templates are used to fully describe published articles. They are currently in use by the Canadian Journal of Economics, Econometrica, the Federal Reserve Bulletin, and IMF Staff Papers, the Journal of Applied Econometrics, the RAND Journal of Economics. These are only a few of the participating journals. Participation does not imply that the articles are freely available.

    The RePEc collection of metadata also contains links to several hundred "software components"--functions, procedures, or code fragments in the Stata, Mathematica, MATLAB, Octave, GAUSS, Ox, and RATS languages, as well as code in FORTRAN, C and Perl. The ability to catalog and describe software components affords users of these languages the ability to search for code applicable to their problem--even if it is written in a different language. Software archives that are restricted to one language, such as those maintained by individual software vendors or volunteers, do not share that breadth. Since many programs in high-level languages may be readily translated from, say, GAUSS to MATLAB, this breadth may be very welcome to the user.

    3: The ReDIF metadata

    From the material that we have covered in the previous section, we can draw a simple model of RePEc as

    Many archives ===> One dataset ===> Many services

    The term "RePEc" is initially an acronym; it stands for Research Papers in Economics. In fact the term should now to be a literal, because RePEc is about more than the description of resources. It is probably best to say that RePEc is a relational database about economics as a discipline.

    One possible interpretation of the term "discipline" is given by Karlsson and Krichel (1999). They have come up with a model of the discipline, as consisting essentially of four elements arranged in a table:

    resource collection
    person institution

    A few words may help to understand that table. A "resource" is essentially any output of academic activity: a research document, a dataset, a computer program, or anything else that an academic person would claim authorship for. A "collection" is a logical grouping of resources. For example the act of peer review may be represented by a resource being included in a collection. A "person" is a physical person or a corporate body who acts as a physical person in the context of RePEc.

    These data collectively form a relational database that not only describes papers, but also the authors who write them, the institutions where the authors work, and so on. All this data is encoded in the ReDIF metadata format. I illustrate this in Subsection 3.2 and Subsection 3.3 for the institutional and the personal data, respectively.

    3.1: A closer look at the contents

    To understand the basics of ReDIF it is best to start with an example. Here is a--carefully selected--piece of ReDIF data at ftp://www.econ.surrey.ac.uk/pub/RePEc/sur/surrec/surrec9601.rdf: (I suppress the Abstract: field to conserve space.)

    Template-Type: ReDIF-Paper 1.0
    Title: Dynamic Aspect of Growth and Fiscal 
     Policy
    Author-Name: Thomas Krichel 
    Author-Person: 
     RePEc:per:1965-06-05:thomas_krichel
    Author-Email: T.Krichel@surrey.ac.uk 
    Author-Name: Paul Levine 
    Author-Email: P.Levine@surrey.ac.uk 
    Author-WorkPlace-Name: University of Surrey
    Classification-JEL: C61; E21; E23; E62; O41 
    File-URL: ftp://www.econ.surrey.ac.uk/pub/
     RePEc/sur/surrec/surrec9601.pdf 
    File-Format: application/pdf
    Creation-Date: 199603 
    Revision-Date: 199711 
    Handle: RePEc:sur:surrec:9601
    
    When we look at this record, the ReDIF data resembles a standard bibliographical format, with authors, title etc.. The only thing that appears a bit mysterious here is the "Author-Person" field. This field quotes a handle that is known to RePEc. This handle leads to a record maintained at ftp://netec.mcc.ac.uk/pub/RePEc/per/pers/RePEc_per_1965-06-05_THOMAS_KRICHEL.rdf: (I leave out a few fields to conserve space.)
    Template-Type: ReDIF-Person 1.0 
    Name-Full: KRICHEL, THOMAS 
    Name-First: THOMAS 
    Name-Last: KRICHEL 
    Postal: 1 Martyr Court
     10 Martyr Road 
     Guildford GU1 4LF 
     England
    Email: t.krichel@surrey.ac.uk
    Homepage: http://openlib.org/home/krichel
    Workplace-Institution: RePEc:edi:desuruk
    Author-Paper: RePEc:sur:surrec:9801
    Author-Paper: RePEc:sur:surrec:9702
    Author-Paper: RePEc:sur:surrec:9601
    Author-Paper: RePEc:rpc:rdfdoc:concepts
    Author-Paper: RePEc:rpc:rdfdoc:ReDIF
    Handle: RePEc:per:1965-06-05:THOMAS_KRICHEL
    

    In this record, we have the handles of documents that the person has written. This record will allow user services to list the complete papers by a given author. This is obviously useful when we want to find papers that one particular author has written. It is also useful to have a central record of the person's contact details. This eliminates the need to update the relevant data elements on every document record. In fact the record on the paper template may be considered as the historical record that is valid at the time when the paper was written, but the address in the person template is the one that is currently valid.

    In the person template, we find another RePEc identifier in the "Workplace-Institution" field. This points to another record at ftp://crefe.dse.uqam.ca/pub/RePEc/edi/inst/desuruk.rdf that describes the institution:

    Template-Type: ReDIF-Institution 1.0 
    Primary-Name: University of Surrey
    Primary-Location: Guildford
    Secondary-Name: Department of Economics
    Secondary-Phone: (01483) 259380
    Secondary-Email: economics@surrey.ac.uk
    Secondary-Fax: (01483) 259548
    Secondary-Postal: Guildford, Surrey GU2 5XH
    Secondary-Homepage: 
     http://www.econ.surrey.ac.uk/
    Handle: RePEc:edi:desuruk
    

    It would take us too far here to discuss this record in more detail. It is probably more interesting to know where these records come from.

    3.2: Institutional registration

    The registration of institutions is accomplished through the EDIRC project. The acronym stands for "Economics Departments, Institutions and Research Centers". This dataset has been compiled by Christian Zimmermann, an Associate Professor of Economics at Unversité du Québec à Montréal on his own account, as a public service to the economics profession. The initial intention was to compile a directory with all economics departments that have a web presence. Since there are many departments that have a web presence now, a large number are now registered, about 5,000 of them at the time of this writing. All these records are included in RePEc. For all institutions, data on their homepage is available, as well as postal and telephone information. For some, there is even data on their main area of work. Thus it is possible to find a list of institutions where--for example--a lot of work in labor economics in being done. At the moment, EDIRC is mainly linked to the rest of the RePEc data through the HoPEc personal registration service. Other links are possible, but are rarely used.

    3.3: Personal registration

    HoPEc has a different organization from EDIRC. It is impossible for a single academic to register all persons who are active in economics. One possible approach would be to ask archives to register people who work at their institution. This will make archive maintainers' work more complicated, but the overall maintenance effort will be smaller once all authors are registered. However, authors move between archives, and many have work that appears in different archives. To date, there is no satisfactory way to deal with moving authors. For this reason, the author registration is carried out using a centralized system.

    A person who is registered with HoPEc is identified by a string that is usually close to the person's name and by a date that is significant to the registrant. HoPEc suggests the birth date but any other date will do as long as the person can remember it. When registrants works with the service, they first supply some personal information. The data that is requested is mainly the name, the URL of the registrant's homepage, and the email address. Registrants are free to enter data about their academic interests--using the Journal of Economic Literature Classification Scheme--and the EDIRC handle of their primary affiliation.

    When the registrant has entered this data, the second step is to create associations between the record of the registrant and the document data that is contained in RePEc. The most common association is the authorship of a paper. However, other associations are possible, for example the editorship of a series. The registration service then looks up the name of the registrant in the RePEc document database. The registrant can then decide which potential associations are relevant. The authentication methods are weak. HoPEc relies on honesty.

    There are several significant problems that a service like HoPEc faces. First, since there is no historical precedent for such a service, it is not easy to commonicate the raison d'être of the service to a potential registrant. Some people think that they need to register in order to use RePEc services. While this delivers valuable information about who is interested in using RePEc services--or more precisely who is too dumb to grasp that these services do not require registration--it clutters the database with records of limited usefulness. Last but by no means least, there are all kinds of privacy issues involved in the composition of such a dataset. For example, Sune Karlsson has informed me that setting up a database such as HoPEc would be illegal in Sweden.

    To summarize, HoPEc provides information about persons' identity, affiliation and research interests and links these data with resource descriptions in RePEc. This allows to identify persons and update their metadata in a timely and cost efficient way. These data could also fruitfully be employed for other purposes, such as maintaining membership data for scholarly societies or for lists of conference participants. It is hoped that the HoPEc data will be used as a shared pool of common personal data.

    4: The open library

    This section of the chapter is somewhat more theoretical. It sets out a body of thought that is built on the experience of RePEc. It is an attempt to find a general theory that could apply in a wide set of circumstances in which similar systems are desirable. I call this general concept, "open library". The parallel to the "open source" concept is intentional. It is therefore useful to review the open source concept first.

    4.1: The open source concept

    There is no official and formal definition what the term "open source" means. On the Open Source Initative an elegant introduction to the idea is found

    The basic idea behind open source is very simple. When programmers can read, redistribute, and modify the source code for a piece of software, the software evolves. People improve it, people adapt it, people fix bugs. And this can happen at a speed that, if one is used to the slow pace of conventional software development, seems astonishing.

    We in the open source commonity have learned that this rapid evolutionary process produces better software than the traditional closed model, in which only a very few programmers can see the source and everybody else must blindly use an opaque block of bits.

    Open source software imposes no restrictions on the distribution the source code of a software. The source code is the code that is required to build a running version of the software. As long as users have no access to source code, they may be able to use a running version of the software, but they can not change the way that the software behaves. The latter involves changing the source code and rebuilding the running version of the software from the source code. Since building the software out of the source code is quite straightforward, software that has freely available source code is essentially free.

    4.2: Open Source and open library

    The open source movement claims that the building of software in an open, collaborative way--enabled by the sharing of source code--allows to build software better and faster. The open library concept is an attempt to apply the concept of open source to a library setting. We start off with the RePEc experience.

    Within the confines of RePEc as a document collection, it is unrealistic to expect a free distribution of document source code. Such source code is, for example, the word processor file of an academic paper. If such source code would be available for others to change, then the ownership of the intellectual property in the document would be dissolved. Since intellectual property over scientific ideas is crucial in the academic reward system, it is unlikely that such source code distribution will take place. Within the confines of RePEc's institutional and personal collection, there is no such source code that could be freely shared.

    To apply the open source principle to RePEc we must conceptualize RePEc as a collection of metadata. The term "metadata" literally means data about data. Strictly speaking its use is inappropriate in the context of RePEc because some of the objects of description of RePEc are not data but physical objects. However, I will continue to use the term metadata for the kind of data that is collected by RePEc.

    In terms of the language adopted by the open source concept, the metadata record is the "source code". The way the metadata record is rendered in the user interface is the "software" as used by the end user. We can the define the open library as a collection of metadata records that has few special properties.

    4.3: The definition of the open library

    An open library can be defined as follows. An open library is a collection of metadata records that has the following characteristics

  • Every record is identified by a unique handle. This requirement distinguishes the library from an archive. It allows for every record to be addressed in an unambiguous way. This is important if links between records are to be established.

  • Records have a homogeneous syntax of field names and field values. This requirement constrains the open library to appear like a database. If this requirement would not be present, all public access pages on the Web would form an open library. Note that this requirement does not constrain the open library to contain a homogeneous record format.

  • The documentation of the record format is available for online public access. For example, a collection encoded in MARC format would not qualify as an open library because access to the documentation of MARC is restricted. Without this requirement the cost of acquiring the documentation would be an obstacle to participation.

  • The collection is accessible on a public access computer system. This is the precondition to allow for the construction of user services. Note that user services may not necessarily be open to public access.

  • The collection is contributable to without monetary cost. There are of course non-monetary costs to contribute to the open library. however the general principle is that there is no need to pay for either contributing or using the library. The copyright status of data in an open library should be subject to further research.
  • 4.4: The open library and the Open Archive

    Stimulated by work of Van de Sompel, Krichel, Nelson et al. (2000), there have been recent moves towards improving the interoperability of e-print archives such as arXiv.org, NCSTRL, RePEc etc. This work is now called the Open Archive Initative (OAi), a term coined by Stevan Harnad. The basic business model proposed by the OAI is very close to the RePEc project. In particular, the open archive technical protocols allow for the separation between data provision and data implementation that is a key feature of the open library model, as pioneered by RePEc since 1997. In addition, because of their ability to transport multiple metadata sets, the open archive protocols allow for several open libraries to establish on one physical system.

    4.5: The conceptual challenge raised by the open library

    The open library as defined in Subsection 4.3 may be a relatively obvious concept. It certainly is not an elaborate intellectual edifice. Nevertheless, the open library idea raises some interesting conceptual challenges.

    4.5.1: Supply of information

    To me as a newcomer to the Library and Information Studies (LIS) discipline, there appears to be a tradition of emphasizing the behavior of the user who demands information rather than the publisher--I use the word here in its widest sense--who supplies it. I presume this orientation comes from the tradition that almost all bibliographic data were sold by commercial or not-for-profit vendors, just as the documents that they describe. Libraries then see their role as intermediaries between the commercial supply and the general public. In that scenario, libraries take the supply of documents and metadata as given.

    The open library proposes to build new supply chains for metadata. If all libraries contribute metadata about objects that are local to them--what that means would have to be defined--then a large open library can be built.

    An open library will only be as good as the data that contributors will give to it. It is therefore important that research be conducted on what data contributors are able to contribute; on how to provide documentation that the contributor can understand; and on understanding a contributor's motivation.

    4.5.2: Digital updatability

    For a long time the library profession has purchased material that is essentially static. It may be subject to physical decay but the material that it contains is immutable. Digital resources have made mass appearance only a few years ago. These resources may be changed at any time. The change from static to dynamic resource is a major challenge for the LIS profession. Naturally the inclination has been to demand that the digital resources be like the non-digital resource in all but their physical medium. The debate on digital preservation is a result of that demand. Thus the dynamic nature of digital metadata has been seen more as a threat rather than as an opportunity. The open library is more concerned with digital updatability than digital preservation.

    4.5.3: Metadata quality control

    In the case of a decentralized dataset, an important problem is to maintain metadata quality. Some elements of metadata quality cannot be controlled by a computer. For example, each record must a structure of fields and values associated with these fields to be interoperable with other records. In some cases the field value only makes sense if it has a certain syntax. This is the case, for example, with an email address. One way to achieve quality control is through the use of relational metadata. Each record has an identifier. Records can use the identifiers of other records. It is then possible to update elements of the dataset in an independent way. It is also quite trivial to check if the handle referenced in one record corresponds to a valid handle in the dataset. Highly controllable metadata systems are an important research concern that is related to the open library concept.

    5: Conclusions

    To my knowledge, Richard Stallman was the pioneer of open source software. He founded the GNU project in 1984 to write a free operating system to replace Unix. At the time few people believed that such an operating system would ever come about. The same may hold for my audience today, when I am calling for an open library. But remember that in the late 1990s the Open Source movement has basically realized Stallman's dream.

    Building GNU took a long time. But the obstacles facing the open source movement are much more daunting then the obstacles facing the open library movement:

  • The structural complexity of the operating system of a modern computer is much higher than the structural complexity of a metadata collection.
  • Computer programming is a highly profitable activity for the individual who is capable of doing it; therefore the opportunity cost of participating in what is essentially an unpaid activity is much higher. These costs are much lower for the academic or the academic librarian who would participate in an open library construction.
  • There is a network effect that arises when the open library has reached a critical mass. At some stage the cost of providing data is much smaller than the benefit--in terms of more efficient dissemination--of contributing data. When that stage is reached, the open library can grow without external public or private subsidy.
  • It remains to be seen how much inroad the open library concept will make.


    The work discussed here has received financial support by the Joint Information Systems Committee oftp://swopec.hhs.se/RePEc/hhs/soft/remi/f the UK Higher Education Funding Councils through its Electronic Library Programme. A version of this paper was presented at the PEAK conference at the University of Michigan on 2000-03-24. I am grateful to Ivan V. Kurmanov for comments on that version. In March 2001, I revised and updated the paper following suggestions by Jeffrey K. MacKie-Mason and Emily Walshe. Neither of them bears responsibility for any remaining errors.

    Appendix