Economics and usage of digital libraries: byting the bullet
13. RePEc, an Open Library for Economics   [†]

Thomas Krichel

arXiv.org, the eprints archive at founded by Paul Ginsparg and Los Alamos National Laboratory, continues to be the leading provider of free eprints in the world. Its subject focus is Physic, Mathematics and Computer Science. There is no evidence supporting the idea that similar collections can be built for other subject areas. This chapter is concerned with an alternative approach as exemplified by the RePEc digital library for economics. RePEc has a different business model and a different content coverage than arXiv.org. This chapter addresses both differences.

As far as the business model is concerned, RePEc is an instance of an "Open Library". Such a library is open in two ways. It is open for contribution (third parties can add to it), and it is open for implementation (many user services may be created). Conventional libraries—-including most digital libraries—-are closed in both senses.

As far as the content coverage is concerned, the term RePEc stands for Research Papers in Economics. However RePEc has a broader mission. It seeks to build a relationaldataset about scholarly resources and other content relating to these resources. The dataset would identify all authors, papers and institutions that work in Economics. Such an ambitious project can only be achieved if the cost to collect data is decentralized and low, and if the benefits of supplying data are large. The Open Library provides a framework where these conditions are fulfilled.

13.1 Introduction

In this chapter I am not concerned with the demand for documents, nor am I concerned with the supply of documents.   [1]Instead, I focus on the supply of information about documents. For some documents, holding detailed information about the document is as good as holding the document itself. This is typically the case when the document can be accessed on the Internet without any access restriction. Such a document will be called a public access document. Collecting data about documents is therefore particularly relevant for public access documents.

The main idea brought forward in this paper is the "Open Library". Basically, an open library is a collaborative framework for the supply and usage of document data. Stated in this way the idea of the open library seems quite trivial. To fully appreciate the concept, it is useful to study one open library in more detail. My example is the RePEc dataset about Economics. In Section 13.2 , I introduce RePEc as a document data collection. In Section 13.3 , I push the RePEc idea further. I discuss the extension of RePEc that allows one to describe the discipline, rather than simply the documents that are produced by the members of the discipline. In Section 13.4 , I make an attempt to define the open library more precisely. The example of RePEc demonstrates the relevance of the open library concept. I conclude the paper in Section 13.5 .

The efforts of which RePEc is the result go back to 1992. I deliberately stayed away from a description of the history of the work to concentrate on the current status. Therefore, insufficient attribution is given to the people who have contributed to the RePEc effort. See Krichel (1997) for an account of the early history of the NetEc projects. These can be regarded as precursors of RePEc.

13.2 The RePEc document dataset
Origin and motivation of RePEc

A scholarly communication system brings together producers and consumers of documents. For the majority of the documents, the producers do not receive a monetary reward. Their effort is compensated through a wide circulation of the document and peer approval of it. Dissemination and peer approval are the key functions of scholarly communication.

Scholarly communication in Economics has largely been journal-based. Peer review plays a crucial role. Thorough peer review is expensive in time. According to Trivedi (1993) , a paper commonly takes over three years from submission to publication in an academic journal, not counting rejections. From informal evidence, slowly rising publication delays have been curbed in the past few years as journal editors have fought hard to cut down on what have been perceived to be intolerable delays.

Researchers at the cutting edge cannot rely solely on journals to keep abreast of the frontiers of research. Prepublication through discussion papers or conference proceedings is now commonplace. Access to this informally-disseminated research is often limited to a small number of readers. It relies on the good will of active researchers to disseminate their work. Since good will is in short supply, insider circles are common.

This time gap between informal distribution and formal publication can only fundamentally be resolved by reforming the quality control process. The inconvenience resulting from the delay can, however, be reduced by improving the efficiency of the informal communication system. This is the initial motivation behind the RePEc project. Its traditional emphasis has been on documents that have not gone through peer review channels. Thus RePEc is essentially a scholarly dissemination system, independent of the quality review process, on the Internet.

Towards an Internet-based scholarly dissemination system

The Internet is a cost-effective means for scholarly dissemination. Many economics researchers and their institutions have established web sites. However, they are not alone in offering pages on the Web. The Web has grown to an extent that the standard Internet search engines only cover a fraction of the Web, and that fraction is decreasing over time ( Lawrence and Giles, 1999 ). Since much of economics research uses common terms such as "growth", "investment" or "money", a subject search on the entire Web is likely to yield an enormous number of hits. There is no practical way to find which pages contain economics research. Due to this low signal-to-noise ratio, the Web per sedoes not provide an efficient mechanism for scholarly dissemination. An additional classifying scheme is required to segregate references to materials of interest to the economics profession.

The most important type of material relevant to scholarly dissemination are research papers. One way to organize this type of material has been demonstrated by the arXiv.org preprint archive, founded in 1991 by Paul Ginsparg of the Los Alamos National Laboratory, with an initial subject area in high energy physics. Authors use that archive to upload papers that are stored there. ArXiv.org has now assembled over 150,000 papers, covering a broad subject range of mathematics, physics and computer science, but concentrating on the original subject area. An attempt has been made to emulate the arXiv.org system in economics with the "Economics Working Paper Archive" (EconWPA) based at Washington University in St. Louis, but success has been limited. There are a number of potential reasons:

  • Economists do not issue preprints as individuals; rather, economics departments and research organizations issue working papers.

  • Economists use a wider variety of document formatting tools than physicists. This reduces the functionality of online archiving and makes it more difficult to construct a good archive.

  • Generally, economists are not known for sophisticated practices in computer literacy and are more likely to encounter significant problems with uploading procedures.

  • There is considerable confusion as to the implications of networked pre-publication on a centralized, high-visibility system for the publication in journals.

  • Economics research is not confined to university departments and research institutes. There are a number of government bodies—central banks, statistical institutes, and others—which contribute a significant amount of research in the field. These bodies, by virtue of their size, have more rigid organizational structures. This makes the coordination required for the central dissemination of research more difficult.

An ideal system should combine the decentralized nature of the Web, the centralized nature of the arXiv.org archive, and a zero price to end users. I discuss these three requirements in turn.

The system must have decentralized storage of documents. To illustrate, let us consider the alternative scenario. This would be one where all documents within a certain scope, say within a discipline, would be held on one centralized system. Such a system would not be ideal for three reasons. First, those authors who are rejected by that system would have no alternative publication venue. Since Economics is a contested discipline, this is not ideal. Second, the storage and description of documents is costly. The centralized system may levy a charge on contributors to cover its cost. However, since it enjoys a monopoly, it is likely to use this position to extract rent from authors. This would not be ideal.

On the other hand, we need access points to the documents for both usage of the documents by end users, as well as for the monitoring of this usage. These activities are best conducted when a centralized document storage is availble, such as the one that arXiv.org affords. Otherwise the economics paperes become lost in the complete contents of the web and their usage is recorded in the web logs of many servers. Such usage logs are private to the manangement of the web servers. They can not be used to monitor usage.

To explain why the end-user access to the dissemination system should be free, it is useful to refer to Harnad's distinction between trade authors and esoteric authors ( 1995a ). Authors of academic documents are esoteric authors rather than trade authors. They do not expect payments for the written work; instead, they are chiefly interested in reaching an audience of other esoteric authors, and to a lesser extent, the public at large. Therefore the authors are interested in wide dissemination. If a tollgate to the dissemination system is established, then the system will fall short of ideal.

Having established the three criteria for an ideal system, let me turn to the problem of implementing it. The first and third objectives could be accomplished if departments and research centers allow public access to their documents on the Internet. But for the second, we need a library to hold an organized catalog. The library would collect what is known as "metadata": data aboutdocuments that are available using Internet protocols. There is no incentive for any single institution to bear the cost of establishing a comprehensive metadata collection, without external subsidy. However, since every institution will benefit from participation in such an effort, we may solve this incentive problem by creating a virtual collection via a network of linked metadata archives. This network is open in the sense that persons and organizations can join by contributing data about their work. It is also open in the sense that user services can be created from it. This double openness promotes a positive feedback effect. The larger the collection's usage, the more effective it is as a dissemination tool, thus encouraging more authors and their institutions to join, as participation is open. The larger the collection, the more useful it becomes for researchers, which leads to even more usage.

Bringing a system to such a scale is a difficult challenge. Change in the area of scholarly communication has been slow, because academic careers are directly dependent on its results. scholarly communication. Change is most likely to be driven from within. Therefore, scholarly dissemination system on the Internet is more likely to succeed if it enhances current practice, without a threat to replace it. In the past, The distribution of informal research papers has been based on institutions issuing working papers. These are circulated through exchange arrangements. RePEc is a way to organize this process on the Internet.

The architecture of RePEc

RePEc can be understood as a decentralized academic publishing system for the economics discipline. RePEc allows researchers' departments and research institutes to participate in a decentralized archival scheme which makes information about the documents that they publish accessible via the Internet. Individual researchers may also openly contribute, but they are encouraged to use EconWPA.

Each contributor needs to maintain a separate collection of data using a set of standardized templates. Such a collection of templates is called an "archive". An archive operates on an anonymous ftp server or a Web server controlled by the archive provider. Each archive provider has total control over the contents of its archive. There is no need to transmit documents elsewhere. The archive provider retains the liberty to post revisions or to withdraw a document.

An example archive.Let us look at an example. The archive of the OECD is at http://web.archive.org/web/20010829193045/http://www.oecd.org/eco/RePEc/oed/ . In that directory we find two files. The first is oedarch.rdf:

Template-Type: ReDIF-Archive 1.0
Handle: RePEc:oed
Name: OECD Economics Department
Maintainer-Email: eco.contact@oecd.org
URL: http://www.oecd.org/eco/RePEc/oed

This file gives basic characteristics about the archive. It associates a handle with it, gives an email address for the maintainer, and most importantly, provides the URL where the archive is located. This archive file gives no indication about the contents of the archive. The contents list is in a second file, oedseri.rdf:

Template-type: ReDIF-Series 1.0
Name: OECD Economics Department working papers
Type: ReDIF-Paper
Provider-Name: OECD Economics Department
Provider-Homepage: http://www.oecd.org/eco/eco/
Maintainer-Email: eco.contact@oecd.org
Handle: RePEc:oed:oecdec

This file lists the content as a series of papers. It associates some provider and maintainer data with the series, and it associates a handle with the series. The format that both files follow is called ReDIF. It is a purpose-built metadata format. Appendix B discusses technical aspects of the ReDIF metadata format that is used by RePEc. See Krichel (2000) for the complete documentation of ReDIF.

The documents themselves are also described in ReDIF. The location of the paper description is found through appending the handle to the URL of the archive, i.e. at http://web.archive.org/web/20010627025821/www.oecd.org/eco/RePEc/oed/oecdec/ . This directory contains ReDIF descriptions of documents. It may also contain the full text of documents. It is up to the archive to decide whether to store the full text of documents inside or outside the archive. If the document is available online—inside or outside the archive—a link may be provided to the place where the paper may be downloaded. Note that the document may not only be the full text of an academic paper, but it may also be an ancillary files, e.g. a dataset or a computer program.

Participation does not imply that the documents are freely available. Thus, a number of journals have also permitted their contents to be listed in RePEc. If the person's institution has made the requisite arrangements with publishers (e.g. JSTOR for back issues of Econometrica or Journal of Applied Econometrics), RePEc will contain links to directly access the documents.

Using the data on archives.One way to make use of the data would be to have a web page that lists all the available archives, and allow users to navigate the archives searching for documents of interest. However, that would be a primitive way to access the data. First, the data as shown in the ReDIF form is not itself hyperlinked. Second, there is no search facility nor filtering of contents.

Providing services that allow for convenient access is not a concern for the archives, but for user services. User services render the RePEc data in a form that make it convenient for a user. User services are operated by members of the RePEc community, libraries, research projects etc.. Each service has its own name. There is no "official" RePEc user service. A list of services in at the time of writing may be found in Appendix A.

User services are free to use RePEc data in whatever way they see fit, as long as they observe the copyright statement for RePEc. This statement places some constraints on the usage of RePEc data:

You are free to do whatever you want with this data collected on the archives that are described here, provided that you
(a) Don't charge for it or include it in a service or product that is not free of charge.
(b) When displaying the contents of a template (or part of a template) the following fields must be shown if they are present in the template: Title, Author-Name, File-Restriction and Copyright (if present).
(c) You must contribute to RePEc by maintaining an archive that actively contributes material to RePEc.
(d) You do not contravene any copyright statement found in any of the participating archives.

Within the constraints of that copyright statement, user services are free to provide all or any portion of the RePEc data. Individual user services may place further constraints on the data, such as quality or availability filters.

Because all RePEc services must be free, user services compete through quality rather than price. All RePEc archives benefit from simultaneous inclusion in all services. This leads to an efficient dissemination that a proprietary system can not afford.

Building user services.The provision of a user service usually starts with putting frequently updated copies of RePEc archives on a single computer system. This maintenance of a frequently updated copy of archives is called "mirroring". Everything contained in an archive may be mirrored. For example, if a document is in the archive, it may be mirrored. If the archive management does not wish the document to be mirrored, it can store it outside the archive. The advantage of this remote storage is that the archive maintainer will get a complete set of access logs to the file. The disadvantage is that every request for the file will have to be served from the local archive rather than from the RePEc site that the user is accessing.

An obvious way to organize the mirroring process overall would be to mirror the data of all archives to a central location. This central location would in turn be mirrored to the other RePEc sites. The founders of RePEc did not adopt that solution because it would be quite vulnerable to mistakes at the central site. Instead, each site installs the mirroring software and mirrors its own data. Not all sites adopt the same frequency of updating. Some may update daily, while some may only update weekly. A disadvantage of this system is that it is not known how long it takes for a new item to be propagated through the system.

The documents available through RePEc

Over 160 archives, some of them representing several institutions, in 25 countries currently participate in RePEc. Over 100 universities contribute their working papers, including U.S. institutions such as Berkeley, Boston College, Brown, Maryland, MIT, Iowa, Iowa State, Ohio State, UCLA, and Virginia. The RePEc collection also contains information on all NBER Working Papers, the CEPR Discussion Papers, the contents of the Fed in Print database of the US Federal Reserve, and complete paper series from the IMF, World Bank and OECD, as well as the contributions of many other research centers worldwide. RePEc also includes the holdings of EconWPA. In total, at the time of writing in March 2001, over 37,000 items are downloadable.

The bibliographic templates describing each item currently provide for papers, articles, and software components. The article templates are used to fully describe published articles. They are currently in use by the Canadian Journal of Economics, Econometrica, the Federal Reserve Bulletin, IMF Staff Papers, the Journal of Applied Econometrics, and the RAND Journal of Economics. These are only a few of the participating journals.

The RePEc collection of metadata also contains links to several hundred "software components"—functions, procedures, or code fragments in the Stata, Mathematica, MATLAB, Octave, GAUSS, Ox, and RATS languages, as well as code in FORTRAN, C and Perl. The ability to catalog and describe software components affords users of these languages the ability to search for code applicable to their problem—even if it is written in a different language. Software archives that are restricted to one language, such as those maintained by individual software vendors or volunteers, do not share that breadth. Since many programs in high-level languages may be readily translated from, say, GAUSS to MATLAB, this breadth may be very welcome to the user.

13.3 The ReDIF metadata

From the material that we have covered in the previous section, we can draw a simple organizational model of RePEc as:

Many archives ⇒ One dataset ⇒ Many services

Let us turn from the organization of RePEc to its contents. RePEc is about more than the description of resources. It is probably best to say that RePEc is a relational database about economics as a discipline.

One possible interpretation of the term "discipline" is given by Karlsson and Krichel (1999) . They have come up with a model of a discipline as consisting of four elements arranged in a table:

resource collection
person institution

A few words may help to understand that table. A "resource" is any output of academic activity: a research document, a dataset, a computer program, or anything else that an academic person would claim authorship for. A "collection" is a logical grouping of resources. For example, one collection might be comprised of all articles that have undergone the peer review process. A "person" is a physical person; a person may also be a corporate body acting as a physical person in the context of RePEc.

These data collectively form a relational database describing not only the papers, but also the authors who write them, the institutions where the authors work, and so on. All this data is encoded in the ReDIF metadata format, as illustrated in the following examples.

A closer look at the contents

To understand the basics of ReDIF it is best to start with an example. Here is a piece of ReDIF data at http://www.econ.surrey.ac.uk/discussion_papers/RePEC/sur/surrec/surrec9601.pdf :   [2]

Template-Type: ReDIF-Paper 1.0
Title: Dynamic Aspect of Growth and Fiscal Policy
Author-Name: Thomas Krichel
Author-Person: RePEc:per:1965-06-05:thomas_krichel
Author-Email: T.Krichel@surrey.ac.uk
Author-Name: Paul Levine
Author-Email: P.Levine@surrey.ac.uk
Author-WorkPlace-Name: University of Surrey
Classification-JEL: C61; E21; E23; E62; O41
File-URL: ftp://www.econ.surrey.ac.uk/pub/ RePEc/sur/surrec/surrec9601.pdf
File-Format: application/pdf
Creation-Date: 199603
Revision-Date: 199711
Handle: RePEc:sur:surrec:9601

When we look at this record, the ReDIF data resembles a standard bibliographical format, with authors, title etc.. The only thing that appears a bit mysterious here is the "Author-Person" field. This field quotes a handle that is known to RePEc. This handle leads to a record maintained at a RePEc handle server.   [3]

Template-Type: ReDIF-Person 1.0
Name-Full: KRICHEL, THOMAS
Name-First: THOMAS
Name-Last: KRICHEL
Postal: 1 Martyr Court
10 Martyr Road
Guildford GU1 4LF
England
Email: t.krichel@surrey.ac.uk
Homepage: http://openlib.org/home/krichel
Workplace-Institution: RePEc:edi:desuruk
Author-Paper: RePEc:sur:surrec:9801
Author-Paper: RePEc:sur:surrec:9702
Author-Paper: RePEc:sur:surrec:9601
Author-Paper: RePEc:rpc:rdfdoc:concepts
Author-Paper: RePEc:rpc:rdfdoc:ReDIF
Handle: RePEc:per:1965-06-05:THOMAS_KRICHEL

In this record, we have the handles of documents that the person has written. This record will allow user services to list the complete papers by a given author. This is obviously useful when we want to find papers that one particular author has written. It is also useful to have a central record of the person's contact details. This eliminates the need to update the relevant data elements on every document record. In fact the record on the paper template may be considered as the historical record that is valid at the time when the paper was written, but the address in the person template is the one that is currently valid.

In the person template, we find another RePEc identifier in the "Workplace-Institution" field. This points to a record that describes the institution, stored at another RePEc handle server.

Template-Type: ReDIF-Institution 1.0
Primary-Name: University of Surrey
Primary-Location: Guildford
Secondary-Name: Department of Economics
Secondary-Phone: (01483) 259380
Secondary-Email: economics@surrey.ac.uk
Secondary-Fax: (01483) 259548
Secondary-Postal: Guildford, Surrey GU2 5XH
Secondary-Homepage: http://www.econ.surrey.ac.uk/
Handle: RePEc:edi:desuruk

This information in this record is self-explanatory. Less apparent is the origin of these records.

Institutional registration

The registration of institutions is accomplished through the Economics Departments, Institutions and Research Centers (EDIRC) project, compiled by Christian Zimmermann, an Associate Professor of Economics at Unversité du Québec à Montréal on his own account, as a public service to the economics profession. The initial intention was to compile a directory of all economics departments that have a web presence. Many departments that have a web presence now; about 5,000 of them are registered at the time of this writing. All these records are included in RePEc. For each institution, data on its homepage is available, as well as postal and telephone information. For some, there is even data on the main area of work. Thus it is possible to find a list of institutions where—for example—a lot of work in labor economics in being done. At the moment, EDIRC is mainly linked to the rest of the RePEc data through the HoPEc   [4]personal registration service. Other links are possible, but are rarely used.

Personal registration

HoPEc has a different organization from EDIRC. It is impossible for a single academic to register all persons who are active in Economics. One possible approach would be to ask archives to register people who work at the related institution. This will make archive maintainers' work more complicated, but the overall maintenance effort will be smaller once all current authors are registered. However, authors move between archives, and many have work that appears in different archives. To date, there is no satisfactory way to deal with moving authors. For this reason, the author registration is carried out using a centralized system.

A person who is registered with HoPEc is identified by a string that is usually close to the person's name and by a date that is significant to the registrant. HoPEc suggests the birth date but any other date will do as long as the person can remember it. When registrants work with the service, they first supply such personal information as the name, the URL of the registrant's homepage, and the email address. Registrants are free to enter data about their academic interests—using the Journal of Economic Literature Classification Scheme—and the EDIRC handle of their primary affiliation.

When the registrant has entered this data, the second step is to create associations between the record of the registrant and the document data that is contained in RePEc. The most common association is the authorship of a paper; however, other associations are possible, for example the editorship of a series. The registration service then looks up the name of the registrant in the RePEc document database. The registrant can then decide which potential associations are relevant. Because authentication methods are weak, HoPEc relies on honesty.

There are several significant problems that a service like HoPEc faces. First, since there is no historical precedent for such a service, it is not easy to communicate the raison d'être of the service to a potential registrant. Some people think that they need to register in order to use RePEc services. While this delivers data about who is interested in using RePEc services—and to whom we have been unsucessful to communicate that these services are free—it clutters the database with records of limited usefulness. Last but by no means least, there are all kinds of privacy issues involved in the composition of such a dataset.

To summarize, HoPEc provides information about a person's identity, affiliation and research interests, and links these data with resource descriptions in RePEc. This allows the identification of a person and the maintainance of related metadata in a timely and cost-efficient way. These data could fruitfully be employed for other purposes, such as maintaining membership data for scholarly societies or lists of conference participants.

13.4 The open library

This section attempts to find a general theory applicable to a wide set of circumstances in which systems similar to RePEc are desirable. I call this general concept the open library. The parallel to the open source concept is intentional. It is therefore useful to review the open source concept first.

The open source concept

There is no official and formal definition what the term, open source, means. On the Open Source Initative at http://opensource.org/ an elegant introduction to the idea is found:

The basic idea behind open source is very simple. When programmers can read, redistribute, and modify the source code for a piece of software, the software evolves. People improve it, people adapt it, people fix bugs. And this can happen at a speed that, if one is used to the slow pace of conventional software development, seems astonishing.
We in the open source community have learned that this rapid evolutionary process produces better software than the traditional closed model, in which only a very few programmers can see the source and everybody else must blindly use an opaque block of bits.

Open source software imposes no restrictions on the distribution of the source code required to build a running version of the software. As long as users have no access to the source code, they may be able to use a running version of the software, but they can not change the way that the software behaves. The latter involves changing the source code and rebuilding the running version of the software from the source code. Since building the software out of the source code is quite straightforward, software that has a freely available source code is essentially free.

Open Source and open library

The open source movement claims that the building of software in an open, collaborative way—enabled by the sharing of the source code—allows software to be built better and faster. The open library concept is an attempt to apply the concept of the open source to a library setting. We start off with the RePEc experience.

Within the confines of RePEc as a document collection, it is unrealistic to expect free distribution of a document's source code. Such a source code is, for example, the word processor file of an academic paper. If such a source code were available for others to change, then the ownership of the intellectual property in the document would be dissolved. Since intellectual ownership over scientific ideas is crucial in the academic reward system, it is unlikely that such source code distribution will take place. Within the confines of RePEc's institutional and personal collection, there is no such source code that could be freely shared.

To apply the open source principle to RePEc we must conceptualize RePEc as a collection of data. In terms of the language adopted by the open source concept, the individual data record is the "source code". The way the data record is rendered in the user interface is the "software" as used by the end user. We can the define the open library as a collection of data records that has a few special properties.

The definition of the open library

An open library is a collection of data records that has the following characteristics:

  • Every record is identified by a unique handle. This requirement distinguishes the library from an archive. It allows for every record to be addressed in an unambiguous way. This is important if links between records are to be established.

  • The syntax in all records of field names and field values is homogeneous. This constraint causes the open library to appear like a database. If this requirement were not present, all public access pages on the Web would form an open library. Note that this requirement does not constrain the open library to contain a homogeneous record format.

  • The documentation of the record format is available for online public access. For example, a collection encoded in MARC format would not qualify as an open library because access to the documentation of MARC is restricted. Without this requirement the cost of acquiring the documentation would be an obstacle to participation.

  • The collection is accessible on a public access computer system. This is the precondition to allow for the construction of user services. Note that user services may not necessarily be open to public access.

  • Contributing to the collection is without monetary cost. There are of course non-monetary costs to contribute to the open library. However the general principle is that there is no need to pay for either contributing or using the library. The copyright status of data in an open library should be subject to further research.

The open library and the Open Archive

Stimulated by work of Van de Sompel, Krichel, Nelson, et al. (2000) , there have been recent moves towards improving the interoperability of e-print archives such as arXiv.org, NCSTRL, and RePEc. This work is now called the Open Archive Initative, see http://www.OpenArchives.org . The basic business model proposed by the OAI is very close to that of the RePEc project. In particular, the open archive technical protocols allow data provision to be separated from data implementation, a key feature of the open library model as pioneered by RePEc since 1997. In addition, because of their ability to transport multiple data sets, the open archive protocols allow for several open libraries to be established on one physical system.

The conceptual challenge raised by the open library

The open library as defined in Subsection 13.4 may be a relatively obvious concept. It certainly is not an elaborate intellectual edifice. Nevertheless, the open library idea raises some interesting conceptual challenges.

Supply of information.To me as a newcomer to the Library and Information Studies (LIS) discipline, there appears to be a tradition of emphasizing the behavior of the user who demands information rather than the publisher—I use the word here in its widest sense—who supplies it. I presume this orientation comes from the tradition that almost all bibliographic data were sold by commercial or not-for-profit vendors, just as the documents that they describe. Libraries then see their role as intermediaries between the commercial supply and the general public. In that scenario, libraries take the supply of documents and data as given.

The open library proposes to build new supply chains for data. If all libraries contribute metadata—data about data—about objects that are local to them—what that means would have to be defined—then a large open library can be built.

An open library will only be as good as the data that contributors give to it. It is therefore important that research be conducted on what data contributors are able to contribute; on how to provide documentation that the contributor can understand; and on understanding a contributor's motivation.

Digital updatability.For a long time, libraries could only purchase material that is essentially static. It might decay physically, but the content is immutable. The advent of digital resources provoked a debate. Because they may be changed at any time, digital resources may be used for more than the preservation of ideas. Traditionally inclined libraries have demanded that digital resources be like non-digital resources in all but appearance, and view the mutability of digital data more as a threat than as an opportunity. The open library, however, is more concerned with digital updatability than preservation. Clearly, this transition from static to dynamic resources poses a major challenge to the LIS profession.

Metadata quality control.In the case of a decentralized dataset, an important problem is to maintain metadata quality. Some elements of metadata quality cannot be controlled by a computer. For example, each record must utilize a structure of fields and values associated with these fields to be interoperable with other records. In some cases the field value only makes sense if it has a certain syntax. This is the case, for example, with an email address. One way to achieve quality control is through the use of relational metadata. Each record has an identifier. Records can use the identifiers of other records. It is then possible to update elements of the dataset in an independent way. It is also simple to check if the handle referenced in one record corresponds to a valid handle in the dataset. Highly controllable metadata systems are an important research concern related to the open library concept.

13.5 Conclusions

To my knowledge, Richard Stallman was the pioneer of open source software. In 1984, when he founded the GNU (" GNU is not UNIX") project to write a free operating system to replace Unix, few people believed that such an operating system would come about. Building GNU took a long time, but in the late 1990s, the open source movement basically realized Stallman's dream. My call for an open library may face similar skepticism, but the obstacles it faces are fewer and less daunting than those faced by the open source movement:

  • The operating system of a modern computer is far less complex than that of a metadata collection.

  • Computer programming is a highly profitable activity for the individual capable of doing it; therefore the opportunity cost of participating in what is essentially an unpaid activity is much higher. These costs are much lower for the academic or the academic librarian who would participate in an open library construction.

  • A network effect arises when the open library has reached a critical mass. At some stage the cost of providing data is much smaller than the benefit—in terms of more efficient dissemination—of contributing data. When that stage is reached, the open library can grow without external public or private subsidy.

It remains to be seen how great an inroad the open library concept will make into the library community.

Appendix A: The main use services   [5]

BibEc at http://netec.mcc.ac.uk/bibec.html & WoPEc at http://netec.mcc.ac.uk/wopec.html provide static html pages for all working papers that are only available in print (BibEc) and all papers that are available electronically (WoPEc). Both datasets use the same search engines. There are three search engines: a full-text WAIS engine, a fielded search engine based on the mySQL relational database and a ROADS fielded search engine. The mySQL database is also used for the control of the relational components in the RePEc dataset. BibEc and WoPEc are based at Manchester Computing in Japan and the United States.

EDIRC at http://edirc.repec.org/ provides web pages that represent the complete institutional information in RePEc.

IDEAS at http://ideas.repec.org/templates.html provides an Excite index of static html pages that represent all paper, article and software templates. This is by far the most popular RePEc user interface.

NEP: New Economics Papers at http://nep.repec.org/ is a set of reports on new additions of papers to RePEc. Each report is edited by subject specialists who receive information on all new additions and then select the papers that are relevant to the subject of the report. These subject specialists are PhD students and junior researchers, who work as volunteers. On 14 March 2000, there are 2753 different email addresses that subscribe to at least one list.

The Tilburg University Working Papers & Research Memoranda service was at http://www.kub.nl/~dbi/demomate/repref.htm, but is now closed. The interface is archived at http://web.archive.org/web/20010305214804/cwis.kub.nl/~dbi/demomate/repref.htm

socionet at http://socionet.ru is a server in Russian. Its maintainers also provide archival facilities for Russian contributors.

INOMICS at http://www.inomics.com/ not only provides an index of RePEc data but also allows simultaneous searches in indices of other Web pages related to Economics.

HoPEc at http://authors.repec.org/ provides a personal registration service for authors and allows searches for personal data.

Appendix B: The ReDIF metadata format

The ReDIF metadata format is inspired by Deutsch et al. (1994) commonly known as the IAFA templates. In particular, it borrows the idea of clusters from the draft:

There are certain classes of data elements, such as contact information, which occur every time an individual, group or organization needs to be described. Such data as names, telephone numbers, postal and email addresses etc. fall into this category. To avoid repeating these common elements explicitly in every template below, we define "clusters" which can then be referred to in a shorthand manner in the actual template definitions.

ReDIF takes a slightly different approach to clusters. A cluster is a group of fields that jointly describe a repeatable attribute of the resource. This is best understood by an example. A paper may have several authors. For each author we may have several fields of interested: name, email address, homepage, etc.. If we have several authors then we have several such groups of attributes. In addition, each author may be affiliated with several institutions. Here each institution may be described by several attributes for its name, homepage etc.. Thus, a nested data structure is required. It is evident that this requirement is best served in a syntax that explicitly allows for it, such as XML. However when ReDIF was designed in 1997, XML was not available. While the template syntax is more humanly readable and easier to understand, the computer can not find which attributes correspond to the same cluster unless some ordering is introduced. Therefore we proceed as follows. For each group of arguments that make up a cluster, we specify one attribute as the "key" attribute. Whenever the key attribute appears a new cluster is supposed to begin. For example, if the cluster describes a person then the name is the key. If an "author-email" appears without an "author-name" preceding it, the parsing software aborts the processing of the template.

Note that the designation of key attributes is not a feature of ReDIF. It is a feature of the template syntax of ReDIF. It is only the syntax that makes nesting more involved. I do not think that this is an important shortcoming. I believe that the nested structure involving the persons and organizations should not be included in the document templates. What should be done instead is to separate the personal information out of the document templates into separate person templates. This approach is discussed extensively in the main body of the paper.

ReDIF is a metadata format that comes with tools to make it easy to use in a framework where the metadata is harvested. A file that is simply harvested from a computer system could contain any type of digital content. Therefore the harvested data must be parsed by a special software that filters the data. This task is accomplished by the rr.pmmodule written by Ivan V. Kurmanov. It parses ReDIF data and validates its syntax. For example, any date within ReDIF has to be of the ISO8601 form yyyy- mm- dd. A date like "14 Juillet 1789" would not be recognized by the ReDIF reading software and not be passed on to application software that a service provider would use.

The rr.pm software uses a formal syntax specification redif.spec. This formal specification is itself encoded in a purpose-built format code-named spefor. Therefore, it is possible for ReDIF-using communities to change the syntax restrictions or even design a whole new ReDIF tag vocabulary metadata vocabulary from scratch.

Notes

†.[The work discussed here has received financial support by the Joint Information Systems committee of the UK Higher Education Funding Councils through its Electronic Library Programme. A version of this paper was presented at the PEAK conference at the University of Michigan on 2000 03 24. I am grateful to Ivan V. Kurmanov for comments on that version. In March 2001, I revised and updated the paper following suggestions by Jeffrey K. MacKie-Mason and Emily Walshe. Neither of them bear responsibility for any remaining errors. This paper is available online at http://openlib.org/home/krichel/salibury.html .]

1.[Reports of research results in research "papers" form the bulk of academic digital or digitisable data. I refer to these as documents.]

2.[I suppress the Abstract: field to conserve space.]

3.[I leave out a few fields to conserve space.]

4.[HoPEc stood initially for Home Page Papers in Economics, but this would be totally misleading now.]

5.[I list them by order of historical appearance. The "Tilburg University working papers & research memoranda" service is operated by a library-based group that has received funding from the European Union. INOMICS is operated by the Economics consultancy Berlecon Research. All the other user services are operated by junior academics. ]


Product of the Scholarly Publishing Office
of the University of Michigan Library
and DLXS
for more information please contact
spo-help@umich.edu