Working towards an Open Library for Economics: The RePEc project </td><td>

Working towards an Open Library for Economics: The RePEc project



Thomas Krichel
Department of Economics
University of Surrey
Stag Hill
Guildford GU2 7XH
United Kingdom


After, the RePEc Economics library offers the second-largest source of freely downloadable scientific preprints in the world. RePEc has a different business model and a different content coverage than This paper addresses both differences.

As far as the business model is concerned, RePEc is an instance of a concept that I call the "Open Library". An Open Library is open in two ways. It is open for contribution (third parties can add to it), and it is open for implementation (many user services may be created). Conventional libraries--including most digital libraries--are closed in both directions.

As far as the content coverage is concerned, RePEc seeks to build a relational dataset about scholarly resources and other aspects of reality that are related to these resources. This basically means identifying all authors, all papers and all institutions that work in economics. Such an ambitious project can only be achieved if the cost to collect metadata is decentralized and low, and if the benefits to supply metadata are large. The Open Library provides a framework where these conditions are fulfilled.

This paper is available in PDF for A4 paper, and for letter size paper.

1: Introduction

Digital or digitisable data is supplied by publishers, to be consumed by readers. Reports of research results in research "papers" form the bulk of academic digital or digitisable data, and I will refer to these as documents in following.

In this paper I am not concerned with the demand for document, nor am I not concerned with the supply of documents. Instead I focus on the supply of information about documents. For some documents holding detailed information about the document is as good as holding the document itself. This is typically the case when the document can be accessed on the Internet without any access restriction. Such a document will be called a public access document. Collecting metadata is therefore particularly relevant for public access documents.

The main idea that is brought forward in this paper is the "Open Library". Basically an Open Library is a collaborative framework for the supply and usage of document metadata. Stated in this way the idea of the Open Library is quite trivial. To fully appreciate the concept, it is useful to study one Open Library in more detail. My example is the RePEc dataset about Economics. In Section 2 I introduce RePEc as a document metadata collection. In Section 3 I push the RePEc idea further. I discuss the extension of RePEc that allows one to describe the discipline, rather than simply the documents that are produced by the members of the discipline. In Section 4 I make an attempt to define the Open Library more precisely. The example of RePEc demonstrates the relevance of the Open Library concept. I conclude the paper in Section 5.

In this article I have deliberately chosen to stay away from many technical issues that arise in the context of RePEc and that have to be addressed by any Open Library. Some more technical aspects have nevertheless been included in the appendices. Appendix 1 lists the main RePEc user services. Appendix 2 discusses technical aspects of the ReDIF metadata format that is used by RePEc.

The efforts of which RePEc is the result go back to 1992. I deliberately stayed away from a description of the history of the work to concentrate on the current status. Therefore insufficient attribution is given to the people how have earned historic merits by contributing to the RePEc effort.

2: The RePEc document dataset

2.1: Origin and motivation of RePEc

A scholarly commonication system brings together producers and consumers of documents. For the majority of the documents, the producers do not receive a monetary reward. Their effort is compensated through a wide circulation of the document and a mark of peer approval for it. Dissemination and peer approval are the dual functions of the commonication system.

Scholarly commonication in economics has largely been journal-based. Peer review plays a crucial rôle. Thorough peer review is expensive in time. According to Trivedi (1993), it is common that a paper takes over three years from submission to publication in an academic journal, not counting rejections. From informal evidence publication delays have slowly risen, but stabilized in the past few years as journal editors have fought hard to cut down on what have been perceived to be intolerable delays.

Researchers at the cutting edge cannot rely solely on journals to keep abreast of the frontier of research. Prepublication through discussion papers or conference proceedings is now commonplace. Access to this informally disseminated research is often limited to a small number of readers. It relies on the good will of active researchers to disseminate their work. Since good will is is short supply, insider circles are common.

This time gap between informal distribution and formal publication can only fundamentally be resolved by reforming the quality control process. The inconvenience resulting from the delay can however be reduced by improving the efficiency of the informal commonication system. This is the initial motivation behind the RePEc project. Therefore its traditional emphasis has been on documents that have not gone through peer review channels. These documents are often available for public access, and therefore a metadata collection is particularly useful.

2.2: Towards an Internet-based scholarly dissemination system

The Internet is a cost-effective means for scholarly dissemination. Many economics researchers and their institutions have established Web sites. However they are not alone in offering pages on the Web. The Web has grown to an extent that the standard Internet search engines only cover a fraction of the Web, and that fraction is decreasing over time (Lawrence and Giles 1999). Since much of the Economics research uses common terms such as "growth", "investment" or "money", it is likely that a subject search on the entire Web would yield an enormous amount of hits. There would be no practical way to find which pages contain economic research. Due to this low signal to noise ratio, the Web per se does not provide an efficient mechanism for scholarly dissemination. An additional classifying scheme is required to segregate references to materials of interest to the economics profession.

The most important type of material relevant to scholarly dissemination are research papers. One way to organize this type of material has been demonstrated by the preprint archive, founded in 1991 by Paul Ginsparg of the Los Alamos National Laboratory with an initial subject area in high energy physics. Authors use that archive to upload papers, which remain stored there. has now assembled over 100,000 papers, covering a broad subject range of mathematics, physics and computer science, but concentrating on the original subject area. An attempt has been made to emulate the system in economics with the "Economics Working Paper Archive" (EconWPA) based at Washington University St. Louis. Its success has been limited. There are a number of potential reasons:

  • Individual economists often do not issue preprints: rather, economics departments and research organizations issue working papers.
  • Economists use a wider variety of document formatting tools than physicists. This reduces the functionality of online archiving and makes it more difficult to construct a good archive.
  • In their book of tales economists read that monopolies are bad for welfare. Accordingly, they may be suspicious of any effort to centralize the output of the discipline on one central system.
  • Economists are not very computer literate, therefore they will have problems with uploading procedures.
  • Economics starts with the assumption that people are rational and selfish. Therefore coordination on a voluntary scheme is difficult to achieve.
  • An ideal system should combine the decentralized nature of the Web, the centralized nature of the archive, and a zero price to end users in order to reach the maximum number of readers. The first two objectives could be accomplished if departments and research centers allow for public access to their documents on the Internet. But then we need a library to catalog these papers. The library would collect what is known as "metadata": data about documents that are available using Internet protocols. This would however fail to satisfy the third objective of zero cost to users. The impediments to creation of that framework are thus not technical but economic in nature. There is no incentive for any single institution to bear the cost of establishing a comprehensive metadata collection, with aspects of a public good, without external subsidy. However, since every institution will benefit from participation in such an effort, we may solve this incentive problem by creating a virtual collection via a network of linked metadata archives. Each institution need only maintain their own collection of document metadata using a set of standardized templates. This strategy is the basic business model of the RePEc system.

    2.3: RePEc as an electronic document exchange

    A scholarly dissemination system on the Internet should start by enhancing the pre-Internet practice rather than attempting to replace it. The distribution of informal research papers in the past has been based on institutions issuing working papers. These are circulated through exchange arrangements. RePEc is a way to organize this process on the Internet. There are three concepts that are crucial to that process: archive, site and service.

  • An "archive" is a space on a public access computer system which makes document metadata available. It is a place where original data enters the system. There is no need to run any software other than an ftp or http daemon that makes the files in the archive available upon request.
  • A "site" is a collection of archives on the same computer system. It usually consists of a local archive augmented by frequently updated ("mirrored") copies of remote archives.
  • A "service" is a rendering of RePEc data in a form that is available to the end user.
  • I will discuss these concepts in turn.

    2.4: Archives

    RePEc allows researchers' departments and research institutes to participate in a decentralized archival scheme which makes information about the documents that they publish accessible via the Internet. Individual researchers may also open archives, but they are encouraged to use EconWPA. A contributor places metadata about its documents on a public access computer system. This is usually an anonymous ftp server or a Web server. This metadata must be in the ReDIF format. If a document is available online, a link may be provided to the place where the paper may be downloaded. Note that the document may not only be the full text of an academic paper, but it may also be an ancillary files, e.g. a dataset or a computer programs.

    Each participating institution has total control over the contents of its archive. There is no need to transmit documents elsewhere. The archive management retains the liberty to post revisions or to withdraw a document.

    Participation does not imply that the documents are freely available. Thus, a number of journals have also permitted their contents to be listed in RePEc. If the person's institution has made the requisite arrangements with publishers (e.g. JSTOR for back issues of Econometrica or Journal of Applied Econometrics), RePEc will contain links to directly access the documents.

    2.5: Sites

    A site is a local archive plus copies of a number of remote archives that are regularly updated, i.e. mirrored. Everything contained in an archive may be mirrored. For example, if a document is in the archive, it may be mirrored. If the archive management does not wish the document to be mirrored, it can store it outside the archive. The advantage of this remote storage is that the archive maintainer will get a complete set of access logs to the file. The disadvantage is that every request for the file will have to be served from the local archive rather than from the RePEc site that the user is accessing.

    An obvious way to organize the mirroring process would be to mirror the data of all archives to a central location. This central location would in turn be mirrored to the other RePEc sites. The founders of RePEc did not adopt that solution, because it would be quite vulnerable to mistakes at the central site. Instead each site installs the mirroring software and mirrors "on its own", so to speak. Not all of them adopt the same frequency of updating. Some may update daily, while some may only update weekly. One disadvantage of this system is that it is not known how long it takes for a new item to be propagated through the system.

    2.6: Services

    Users access RePEc data through user services. Within reason, services are free to use all, or only a subset of the RePEc data. For example, one service may only show papers that are available electronically, another may restrict the choice to act as a quality filter. In this way services implement constraints on the data, whether they be availability constraints or quality constraints. Each service has its own name, i.e. there is no "official" RePEc user service. A list of services currently at the time of writing may be found in Appendix 1.

    The RePEc data may not be sold or incorporated into a product that is sold. Therefore all RePEc services are free. User services compete through quality rather than price. All RePEc archives benefit from simultaneous inclusion in all services. This leads to an efficient dissemination that a proprietary system can not afford.

    2.7: The documents available through RePEc

    About 160 archives in 25 countries currently participate in RePEc, some of them representing several institutions. Over 80 universities contribute their working papers, including in the U.S. Berkeley, Boston College, Brown, Maryland, MIT, Iowa, Iowa State, Ohio State, UCLA, and Virginia. The RePEc collection also contains information on all NBER Working Papers, the CEPR Discussion Papers, the contents of the Fed in Print database of the US Federal Reserve, and complete paper series from the IMF, World Bank and OECD, as well as the contributions of many other research centers and institutes worldwide. Last but not least, RePEc also includes the holdings of EconWPA. In total, at the time of writing, over 35,000 items are downloadable. My estimate is that at least 20,000 of these items have no access restrictions to full-text. That is the largest distributed set of freely downloadable research papers in the world. Up to date figures may be obtained from the RePEc Web site.

    The bibliographic templates describing each item currently provide for papers, articles, and software components. The article templates are used to fully describe published articles, and are currently in use by the Canadian Journal of Economics, Econometrica, the Federal Reserve Bulletin, and IMF Staff Papers, the Journal of Applied Econometrics, the RAND Journal of Economics, to name only a few of the participating journals. Participation does not imply that the articles are freely available.

    The RePEc collection of metadata also contains links to several hundred "software components"--functions, procedures, or code fragments in the Stata, Mathematica, MATLAB, Octave, GAUSS, Ox, and RATS languages, as well as code in FORTRAN, C and Perl. The ability to catalog and describe software components affords users of these languages the ability to search for code applicable to their problem--even if it is written in a different language. Software archives that are restricted to one language, such as those maintained by individual software vendors or volunteers, do not share that breadth. Since many programs in high-level languages may be readily translated from, say, GAUSS to MATLAB, this breadth may be very welcome to the user.

    3: Relational metadata

    The term "RePEc" is initially an acronym; it stands for Research Papers in Economics. In fact the term should now be taken to be a literal, because RePEc is about more than the description of resources. It is probably best to say that RePEc is a relational database about Economics as a "discipline". By a discipline, we mean a group of scholars, past and present, and the intrrelated documnets. That they produce.

    We have already explored the relational aspect in Subsection 3.1. In this Subsection we are interested in refining the view of the discipline.

    One possible approach to a discipline comes form Karlsson and Krichel (1999). They have come up with a model of the discipline, as consisting essentially of four elements arranged in a table

    resource collection
    person institution

    A few words may help to understand that table. A "resource" is essentially any output of academic activity, be it a research document, a dataset, a computer program, i.e., anything that an academic person would claim authorship for. A "collection" is a logical grouping of resources. For example the act of peer review may be represented by a resource being included in a collection. A "person" is a physical person or a corporate body who acts as a physical person in the context of RePEc.

    These data collectively form a relational database that not only describes papers, but also the authors who write them, the institutions where they work, and so on. This leads the profession towards a shared dataset where all important aspects of economists' work may be collected.

    A second approach to the concept of a "discipline" is the idea of the "authority" as described in the ReDIF documentation. An authority is a group of persons who uses ReDIF to document its activities using a common identifier space. Therefore disciplines can be viewed as naming authorities who ensure that every elements in the dataset that are different have different identifiers.

    From the material that we have covered in the previous section, we can draw a simple model of RePEc as

    Many archives ===> One dataset ===> Many services

    In this section we seek to understand what the dataset actually contains. Some of the contents have already been disclosed in Subsection 2.7. To understand what other data is contained in RePEc we need to dig a little deeper. The data is encoded in a format called ReDIF. We introduce this format in the first Subsection.

    3.1: A closer look at the contents

    To understand the basics of ReDIF it is best to start with an example. Here is a--carefully selected--piece of ReDIF data at (I suppress the Abstract: field to conserve space.)

    Template-Type: ReDIF-Paper 1.0
    Title: Dynamic Aspect of Growth and Fiscal 
    Author-Name: Thomas Krichel 
    Author-Name: Paul Levine 
    Author-WorkPlace-Name: University of Surrey
    Classification-JEL: C61; E21; E23; E62; O41 
    File-Format: application/pdf
    Creation-Date: 199603 
    Revision-Date: 199711 
    Handle: RePEc:sur:surrec:9601
    When we look at this record, the ReDIF data appears like a standard bibliographical format, with authors, title etc. The only thing that appears a bit mysterious here is the Author-Person field. This field quotes a handle that is known to RePEc. This handle leads to a record maintained at (I leave out a few fields to conserve space.)
    Template-Type: ReDIF-Person 1.0 
    Name-Full: KRICHEL, THOMAS 
    Name-First: THOMAS 
    Name-Last: KRICHEL 
    Postal: 1 Martyr Court
     10 Martyr Road 
     Guildford GU1 4LF 
    Workplace-Institution: RePEc:edi:desuruk
    Author-Paper: RePEc:sur:surrec:9801
    Author-Paper: RePEc:sur:surrec:9702
    Author-Paper: RePEc:sur:surrec:9601
    Author-Paper: RePEc:rpc:rdfdoc:concepts
    Author-Paper: RePEc:rpc:rdfdoc:ReDIF
    Handle: RePEc:per:1965-06-05:THOMAS_KRICHEL

    In this record, we have the handles of documents that the person has written. This record will allow user services to list the complete papers by a given author. This is obviously useful when we want to find papers that one particular author has written. It is also useful to have a central record of the person's contact details. This eliminates the need to update the relevant data elements on every document record. In fact the record on the paper template may be considered as the historical record that is valid at the time when the paper was written, but the address in the person template is the one that is currently valid.

    In the person template, we find another RePEc identifier in the "Workplace-Institution" field. This points to another record at that describes the institution

    Template-Type: ReDIF-Institution 1.0 
    Primary-Name: University of Surrey
    Primary-Location: Guildford
    Secondary-Name: Department of Economics
    Secondary-Phone: (01483) 259380
    Secondary-Fax: (01483) 259548
    Secondary-Postal: Guildford, Surrey GU2 5XH
    Handle: RePEc:edi:desuruk

    It would take us too far here to discuss this record in more detail. It is probably more interesting to know where these records come from. This is what I address in Subsection 3.2 and Subsection 3.3 for the institutional and the personal data, respectively.

    3.2: Institutional registration

    The registration of institutions is accomplished through the EDIRC project. The acronym stands for "Economics Departments, Institutions and Research Centers". This dataset has been compiled by Christian Zimmermann, an Associate Professor of Economics at Unversité du Québec à Montréal on his own account, as a public service to the economics profession. The initial intention was to compile a directory with all economics departments that have a Web presence. Since there are many departments that have a Web presence now, a large number are now registered, about 5,000 of them at the time of this writing. All these records are included in RePEc. For all institutions, data on their homepage is available, as well as postal and telephone information. For some, there is even data on their main area of work. Thus it is possible to find a list of institutions where--for example--a lot of work in labor economics in being done. At the moment, EDIRC is mainly linked to the rest of the RePEc data through the HoPEc personal registration service. Other links are possible, but are rarely used.

    3.3: Personal registration

    HoPEc has a different organization from EDIRC. It is impossible for a single academic to register all persons who are active in economics. One possible approach would be to ask archives to register people who work at their institution. This will make archive maintainers' work more complicated, but the overall maintenance effort will be smaller once all authors are registered. However, authors move between archives, and many have work that appears in different archives. To date there is no satisfactory way to deal with moving authors. Therefore the author registration is carried out using a centralized system.

    A person who is registered with HoPEc is identified by a string that is usually close to the person's name, and by a date that is significant to the registrant. HoPEc suggests the birthday but any other date will do as long as the person can remember it. When registrants works with the service, they first supply some personal information. The data that is requested is mainly the name, the URL of the registrant's homepage and the email address. Registrants are free to enter data about their academic interests--using the Journal of Economic Literature Classification Scheme--and the EDIRC handle of their primary affiliation.

    When the registrant has entered this data, the second step is to create associations between the record of the registrant and the document data that is contained in RePEc. The most common association is the authorship of a paper. However, other associations are possible, like for example the editorship of a series. The registration service then looks up the name of the registrant in the RePEc document database. The registrant can then decide which potential associations are relevant. The authentication methods are weak. HoPEc relies on honesty.

    There are several significant problems that a service like HoPEc faces. First since there is no historical precedent for such a service, it is not easy to commonicate the raison d'être of the service to a potential registrant. Some people think that they need to register in order to use RePEc services. While this delivers valuable information about who is interested in using RePEc services--or more precisely who is too dumb to grasp that these services do not require registration--it clutters the database with records of limited usefulness. Last but by no means least, there are all kinds of privacy issues involved in the composition of such a dataset. For example, Sune Karlsson has informed me that setting up a database such as HoPEc would be illegal in Sweden.

    To summarize, HoPEc provides information about persons' identity, affiliation and research interests, and links these data with resource descriptions in RePEc. This allows to identify persons and update their metadata in a timely and cost efficient way. These data could also fruitfully be employed for other purposes, such as maintaining membership data for scholarly societies, or lists of conference participants. It is hoped that the HoPEc data will be used in that way in the future as a shared pool of common personal data.

    3.4: The nature of RePEc revisited

    The EDIRC and HoPEc projects have a dual rôle. They both provide data to the RePEc dataset and they provide an interface for users. Thus they are both at odds with the three-layer model I outlined in the opening of the section. In fact the HoPEc service is more at odds with the model because it proposes adding data through a user service.

    4: The Open Library

    This section of the paper is somewhat more theoretical. It sets out a body of thought that is built on the experience of RePEc. It is an attempt to find a general theory that could apply in a wide set of circumstances in which similar systems are desirable. I call this general concept "Open Library". The parallel to the "Open Source" concept is obvious and intentional. In more ways than one, the Open Library intends to be to bibliographic data--a term that should be taken here in the widest sense--what the Open Source is to computer code. I will come back to the parallel between Open Source and Open Library in the concluding Section 5.

    4.1: The definition of the Open Library

    The Open Library is based on a view of the world where any object that one is potentially interested in can be conceptually divided into its body and its data. The body is either the physical manifestation of the object or--in case where such a manifestation does not exist or is irrelevant--it is the part of the object that is non-exclusive use. By non-exclusive use I mean simultaneous use by any number of people.

    Consider as an example Thomas Krichel. Then the body is his physical body, and the data are the written works that he leaves open to be seen, the way he looks, what he says. If we look at a more famous economist say David Ricardo, then his body is not important, because he is dead. However he lives on through his data. As a final example, take the Journal of Monetary Economics, published by Elsevier. In that example, there is no physical manifestation but rather, the body are the papers in the journal itself. The data are the fact that a journal exists, that it has xyz famous economist as its editor, and maybe the title of papers etc.

    The data aspect of an object has a public-domain characteristic that the body of the object does not have. The data aspect can therefore be summarized. Such a summary of the data aspect of an object is usually called metadata. Although the distinction between metadata and data is sometimes arbitrary, it remains a useful concept here.

    An Open Library can then be defined as follows. An Open Library is a collection of metadata records about the data aspect of interesting objects, that has the following characteristics

  • Every record is identified by a unique handle.
  • This requirement distinguishes the library from an archive. It allows for every record to be addressed in an unambiguous way. This is important if links between records are to be established.

  • Records have a homogeneous syntax of field names and field values.
  • This requirement constrains the Open Library to appear like a database. If this requirement would not be present, all public access pages on the WWW would form an Open Library. Note that this requirement does not constrain the Open Library to contain a homogeneous record format.

  • The documentation of the record format is available for online public access.
  • For example, a collection encoded in MARC format would not qualify as an Open Library because access to the documentation of MARC is restricted. Without this requirement the cost of acquiring the documentation would be an obstacle to participation.

  • The collection is accessible on a public access computer system.
  • This is the precondition to allow for the construction of user services. Note that user services may not necessarily be open to public access.

  • Contributions to the collection may be made without monetary cost.
  • There are of course non-monetary costs to contribute to the Open Library. However the general principle is that there is no need to pay for either contributing or using the library. The copyright status of data in an Open Library should be subject to further research.

    4.2: The Open Library and the Open Archive

    Stimulated by work of Van de Sompel, Krichel, Nelson et al. (2000), there have been recent moves towards improving the interoperability of e-print archives such as, NCSTRL, RePEc etc. This work is now called the Open Archive Initative (OAi), a term coined by Stevan Harnad. The basic business model proposed by the OAi is very close to the RePEc project. However there is no official formal definition of what an Open Archive is. There are two possible approaches.

    The Open Archive initiative grew out of e-print initiatives that provide archiving of academic documents on the Internet. However the basic idea of an Open Archive appears to be much more general. Any archive that would store digital objects and that would allow users to poll metadata for these objects could be called an Open Archive. This implies that the Open Archive is a more general concept than the Open Library. Different Open Archives may contain very heterogeneous metadata for highly diverse objects. This heterogeneity may make it impractical or undesirable to combine data for all archives in an Open Library that would require a homogeneous metadata format.

    A more restrictive view of an Open Archive may be that it is any archive that implements the Santa Fe Convention (SFC), as set out by the OAi. If an archive is open in the sense that it implements the SFC, then it will deliver a minimal metadata set for all its objects. All SFC compatible archives supply at least a minimum metadata, called the Open Archives Metadata Set. Implementing the agreement also means identifying each object in the library. This implies that the data held on all SFC compliant archives forms an Open Library. However that Open Library built on the minimal metadata set will hopefully not be be the only Open Library to be built on the object held in SFC compatible Open Archives.

    4.3: The conceptual challenge raised by the Open Library

    The Open Library as defined in Subsection 4.1 may be a relatively obvious concept. It certainly is not an elaborate intellectual edifice. Nevertheless the Open Library idea raises some interesting conceptual challenges.

    4.3.1: Supply of information

    To me as an outsider to the Library and Information Studies (LIS) discipline, there appears to be a tradition of emphasizing the behavior of the user who demands information rather than the publisher--I use the word here in its widest sense--who supplies it. I presume this orientation comes from the tradition that almost all bibliographic data were supplied by commercial companies, just as all documents that they describe. Libraries then see their rôle as intermediaries between the commercial supply and the general public. In that scenario libraries take the supply of documents and metadata as given.

    The Open Library proposes to build new supply chains for metadata. If libraries want to have a share in the Open Library, they will need to become more active in the supply of metadata. If all libraries contribute metadata about objects that are local to them--what that means would have to be defined--then a large open library can be built.

    An Open Library will only be as good as the data that contributors will give to it. It is therefore important that research be conducted on what data contributors are able to contribute; on how to provide documentation that the contributor can understand; and on understanding a contributor's motivation.

    4.3.2: Digital updatability

    For a long time the library profession has purchased material that is essentially static. It may be subject to physical decay but the material that it contains is immutable. Digital resource have made mass appearance only a few years ago. These resources may be changed at any time. The change from static to dynamic resource is a major challenge for the LIS profession. Naturally the inclination has been to demand that the digital resources be like the non-digital resource in all but their physical medium. The debate on digital preservation is a result of that demand. Thus the dynamic nature of digital metadata has been seen more as a threat rather than as an opportunity. The Open Library is more concerned with digital updatability than digital preservation.

    4.3.3: Metadata quality control

    In the case of a decentralized dataset, an important problem is to maintain metadata quality. Some elements of metadata quality can not be controlled by a computer, but others can. One such element is the integrity of handles, another is the use of controlled vocabulary. A metadata system has to be designed that allows for the control of syntactical and relational checks on the metadata. ReDIF allows for syntactical control, but not as yet for relational control. Highly controllable metadata systems are an important research concern that is related to the Open Library concept. One way to achieve quality control is through the use of relational metadata. Each record has an identifier. Records can use the identifiers of other records. It is then possible to update elements of the dataset in an independent way. It is also quite trivial to check if the handle referenced in one record corresponds to a valid handle in the dataset.

    5: Conclusions

    When Richard Stallman founded the GNU project in 1984 to write a free operating system to replace Unix, few believed that such an operating system would ever come about. The same may hold for my audience today, when I am calling for an Open Library. But remember that in the late 90s the Open Source movement has basically realized Stallman's dream.

    Building GNU took a long time. But the obstacles facing the open source movement has are much more daunting then the obstacles facing the Open Library movement:

  • The structural complexity of the operating system of a modern computer is much higher than the structural complexity of a metadata collection.
  • Computer programming is a highly profitable activity for the individual who is capable of doing it, therefore the opportunity cost of participating in what is essentially an unpaid activity is much higher. These costs are much lower for the academic or the academic librarian who would participate in an Open Library construction.
  • There is a network effect that arises when the Open Library has reached a critical mass. At some stage the cost of providing data is much smaller than the benefit--in terms of more efficient dissemination--of contributing data. When that stage is reached, the Open Library can grow without external public or private subsidy.
  • It remains to be seen how much inroad the Open Library concept will make. I will dedicate a considerable part of my professional energies towards advancing the concept. I have registered the domain to serve as a repository for Open Library work. Naturally the work of the Open Library will start with academic metadata. is therefore sponsoring work on an academic metadata format (

    This paper was presented at the PEAK conference at the University of Michigan on 2000-03-24. On 2001-03-05, I made cosmetic changes to this document as suggested by Jeffrey K. MacKie-Mason. These suggestions have much improved the readability of the paper without updating its contents.