||Working towards an Open Library for Economics: The RePEc project
Working towards an Open Library for Economics: The RePEc project
After arXiv.org, the RePEc Economics library offers the
|second-largest source of freely downloadable scientific preprints in the
||world. RePEc has a different business model and a different content
||coverage than arXiv.org. This paper addresses both differences.
As far as the business model is concerned, RePEc is an instance of a
|concept that I call the "Open Library". An Open Library is open in
||two ways. It is open for contribution (third parties can add to it),
||and it is open for implementation (many user services may be
||created). Conventional libraries--including most digital
||libraries--are closed in both directions.
As far as the content coverage is concerned, RePEc seeks to build a
|relational dataset about scholarly resources and other aspects of
||reality that are related to these resources. This basically means
||identifying all authors, all papers and all institutions that work in
||economics. Such an ambitious project can only be achieved if the cost
||to collect metadata is decentralized and low, and if the benefits to
||supply metadata are large. The Open Library provides a framework where
||these conditions are fulfilled.
This paper is
|available in PDF for A4
||paper, and for letter
Digital or digitisable data is supplied by publishers, to be consumed
|by readers. Reports of research results in research "papers" form
||the bulk of academic digital or digitisable data, and I will refer to
||these as documents in following.
In this paper I am not concerned with the demand for document, nor am
|I not concerned with the supply of documents. Instead I focus on the
||supply of information about documents. For some documents holding
||detailed information about the document is as good as holding the
||document itself. This is typically the case when the document can be
||accessed on the Internet without any access restriction. Such a
||document will be called a public access document. Collecting metadata
||is therefore particularly relevant for public access documents.
The main idea that is brought forward in this paper is the "Open
|Library". Basically an Open Library is a collaborative framework for the
||supply and usage of document metadata. Stated in this way the idea of
||the Open Library is quite trivial. To fully appreciate the concept, it
||is useful to study one Open Library in more detail. My example is the
||RePEc dataset about Economics. In
||Section 2 I introduce RePEc as a document
||metadata collection. In Section 3 I push the
||RePEc idea further. I discuss the extension of RePEc that allows one to
||describe the discipline, rather than simply the documents that are
||produced by the members of the discipline. In Section
||4 I make an attempt to define the Open Library more
||precisely. The example of RePEc demonstrates
||the relevance of the Open Library concept. I conclude the
||paper in Section 5.
In this article I have deliberately chosen to stay away from many
|technical issues that arise in the context of RePEc and that have to
||be addressed by any Open Library. Some more technical aspects have
||nevertheless been included in the appendices. Appendix
||1 lists the main RePEc user
||services. Appendix 2 discusses technical
||aspects of the ReDIF metadata format that is used by RePEc.
The efforts of which RePEc is the result go back to 1992. I
|deliberately stayed away from a description of the history of the work
||to concentrate on the current status. Therefore insufficient
||attribution is given to the people how have earned historic merits by
||contributing to the RePEc effort.
2: The RePEc document dataset
2.1: Origin and motivation of RePEc
A scholarly commonication system brings together producers and consumers of
|documents. For the majority of the documents, the producers do not receive a
||monetary reward. Their effort is compensated through a wide circulation of the
||document and a mark of peer approval for it. Dissemination and peer approval
||are the dual functions of the commonication system.
Scholarly commonication in economics has largely been
|journal-based. Peer review plays a crucial rôle. Thorough peer
||review is expensive in time. According to
||(1993), it is common that a paper takes over
||three years from submission to publication in an academic journal, not
||counting rejections. From informal evidence publication delays have
||slowly risen, but stabilized in the past few years as journal editors
||have fought hard to cut down on what have been perceived to be
Researchers at the cutting edge cannot rely solely on journals to
|keep abreast of the frontier of research. Prepublication through
||discussion papers or conference proceedings is now commonplace. Access to
||this informally disseminated research is often limited to a small number of
||readers. It relies on the good will of active researchers to disseminate
||their work. Since good will is is short supply, insider circles are common.
This time gap between informal distribution and formal publication can
|only fundamentally be resolved by reforming the quality control
||process. The inconvenience resulting from the delay can however be
||reduced by improving the efficiency of the informal commonication
||system. This is the initial motivation behind the RePEc
||project. Therefore its traditional emphasis has been on documents that
||have not gone through peer review channels. These documents are often
||available for public access, and therefore a metadata collection is
2.2: Towards an Internet-based scholarly dissemination system
The Internet is a cost-effective means for scholarly dissemination. Many
|economics researchers and their institutions have established Web sites.
||However they are not alone in offering pages on the Web. The Web has grown
||to an extent that the standard Internet search engines only cover a
||fraction of the Web, and that fraction is decreasing over time
||(Lawrence and Giles
||1999). Since much of the Economics
||research uses common terms such as "growth", "investment" or "money",
||it is likely that a subject search on the entire Web would yield an
||enormous amount of hits. There would be no practical way to find which
||pages contain economic research. Due to this low signal to noise ratio, the
||Web per se does not provide an efficient mechanism for scholarly
||dissemination. An additional classifying scheme is required to segregate
||references to materials of interest to the economics profession.
The most important type of material relevant to scholarly
|dissemination are research papers. One way to organize this type of
||material has been demonstrated by the
||arXiv.org preprint archive, founded in 1991
||by Paul Ginsparg of the Los Alamos National Laboratory with an initial
||subject area in high energy physics. Authors use that archive to
||upload papers, which remain stored there. ArXiv.org has now assembled
||over 100,000 papers, covering a broad subject range of mathematics,
||physics and computer science, but concentrating on the original
||subject area. An attempt has been made to emulate the arXiv.org
||system in economics with the "Economics Working Paper
||Archive" (EconWPA) based at Washington
||University St. Louis. Its success has been limited. There are a
||number of potential reasons: |
Individual economists often
|do not issue preprints: rather, economics departments and research
||organizations issue working papers. Economists use a wider
||variety of document formatting tools than physicists. This reduces the
||functionality of online archiving and makes it more difficult to
||construct a good archive. In their book of tales economists read
||that monopolies are bad for welfare. Accordingly, they may be
||suspicious of any effort to centralize the output of the discipline on
||one central system. Economists are not very computer literate,
||therefore they will have problems with uploading procedures.
||Economics starts with the assumption that people are rational and
||selfish. Therefore coordination on a voluntary scheme is difficult to
An ideal system should combine the decentralized nature of
|the Web, the centralized nature of the arXiv.org archive, and a zero
||price to end users in order to reach the maximum number of
||readers. The first two objectives could be accomplished if departments
||and research centers allow for public access to their documents on the
||Internet. But then we need a library to catalog these papers. The
||library would collect what is known as "metadata": data about
||documents that are available using Internet protocols. This would
||however fail to satisfy the third objective of zero cost to users. The
||impediments to creation of that framework are thus not technical but
||economic in nature. There is no incentive for any single institution
||to bear the cost of establishing a comprehensive metadata collection,
||with aspects of a public good, without external subsidy. However,
||since every institution will benefit from participation in such an
||effort, we may solve this incentive problem by creating a virtual
||collection via a network of linked metadata archives. Each institution
||need only maintain their own collection of document metadata using a
||set of standardized templates. This strategy is the basic business
||model of the RePEc system.
2.3: RePEc as an electronic document exchange
A scholarly dissemination system on the Internet
|should start by enhancing the pre-Internet practice
||rather than attempting to replace it. The distribution of
||informal research papers in the past has been based on institutions issuing
||working papers. These are circulated through exchange arrangements. RePEc
||is a way to organize this process on the Internet. There are three concepts
||that are crucial to that process: archive, site and service.
An "archive" is a space on a public access computer system
|which makes document metadata available. It is a place where original data enters the
||system. There is no need to run any software other than an ftp or http
||daemon that makes the files in the archive available upon request. A
||"site" is a collection of archives on the same computer system. It
||usually consists of a local archive augmented by frequently updated
||("mirrored") copies of remote archives. A "service" is a rendering
||of RePEc data in a form that is available to the end user. I will
||discuss these concepts in turn.
RePEc allows researchers' departments and research institutes to
|participate in a decentralized archival scheme which makes information
||about the documents that they publish accessible via the Internet.
||Individual researchers may also open archives, but they are encouraged
||to use EconWPA. A contributor places metadata about its documents on
||a public access computer system. This is usually an anonymous ftp
||server or a Web server. This metadata must be in the ReDIF format.
||If a document is available online, a link may be provided to the place
||where the paper may be downloaded. Note that the document may not only
||be the full text of an academic paper, but it may also be an ancillary
||files, e.g. a dataset or a computer programs.
Each participating institution has total control over the contents of
|its archive. There is no need to transmit documents elsewhere. The
||archive management retains the liberty to post revisions or to
||withdraw a document.
Participation does not imply that the documents are freely
|available. Thus, a number of journals have also permitted their
||contents to be listed in RePEc. If the person's institution has made the
||requisite arrangements with publishers (e.g. JSTOR for back issues of
||Econometrica or Journal of Applied Econometrics), RePEc will contain links
||to directly access the documents.
A site is a local archive plus copies of a number of remote archives
|that are regularly updated, i.e. mirrored. Everything contained in an
||archive may be mirrored. For example, if a document is in the archive,
||it may be mirrored. If the archive management does not wish the
||document to be mirrored, it can store it outside the archive. The
||advantage of this remote storage is that the archive maintainer will
||get a complete set of access logs to the file. The disadvantage is
||that every request for the file will have to be served from the local
||archive rather than from the RePEc site that the user is accessing.
An obvious way to organize the mirroring process would be to mirror
|the data of all archives to a central location. This central location
||would in turn be mirrored to the other RePEc sites. The founders of
||RePEc did not adopt that solution, because it would be quite
||vulnerable to mistakes at the central site. Instead each site installs
||the mirroring software and mirrors "on its own", so to speak. Not
||all of them adopt the same frequency of updating. Some may update
||daily, while some may only update weekly. One disadvantage of this
||system is that it is not known how long it takes for a new item to be
||propagated through the system.
Users access RePEc data through user services. Within reason,
|services are free to use all, or only a subset of the RePEc data. For
||example, one service may only show papers that are available
||electronically, another may restrict the choice to act
||as a quality filter. In this way services implement constraints on the
||data, whether they be availability constraints or quality
||constraints. Each service has its own name, i.e. there is no
||"official" RePEc user service. A list of services currently at the
||time of writing may be found in Appendix 1.
The RePEc data may not be sold or incorporated into a product that is
|sold. Therefore all RePEc services are free. User services compete
||through quality rather than price. All RePEc archives benefit from
||simultaneous inclusion in all services. This leads to an efficient
||dissemination that a proprietary system can not afford.
2.7: The documents available through RePEc
About 160 archives in 25 countries currently participate in RePEc,
|some of them representing several institutions. Over 80 universities
||contribute their working papers, including in the U.S. Berkeley,
||Boston College, Brown, Maryland, MIT, Iowa, Iowa State, Ohio State,
||UCLA, and Virginia. The RePEc collection also contains information on
||all NBER Working Papers, the CEPR Discussion Papers, the contents of
||the Fed in Print database of the US Federal Reserve, and complete
||paper series from the IMF, World Bank and OECD, as well as the
||contributions of many other research centers and institutes worldwide.
||Last but not least, RePEc also includes the holdings of EconWPA. In
||total, at the time of writing, over 35,000 items are
||downloadable. My estimate is that at least 20,000 of these
||items have no access restrictions to full-text.
||That is the largest distributed set of freely
||downloadable research papers in the world. Up to date
||figures may be obtained from the
||RePEc Web site.
The bibliographic templates describing each item currently provide for
|papers, articles, and software components. The article templates are
||used to fully describe published articles, and are currently in use by
||the Canadian Journal of Economics, Econometrica, the Federal Reserve
||Bulletin, and IMF Staff Papers, the Journal of Applied Econometrics,
||the RAND Journal of Economics, to name only a few of the participating
||journals. Participation does not imply that the articles are freely
The RePEc collection of metadata also contains links to several
|hundred "software components"--functions, procedures, or code
||fragments in the Stata, Mathematica, MATLAB, Octave, GAUSS, Ox, and
||RATS languages, as well as code in FORTRAN, C and Perl. The ability to
||catalog and describe software components affords users of these
||languages the ability to search for code applicable to their
||problem--even if it is written in a different language.
||Software archives that are restricted to one language, such as those
||maintained by individual software vendors or volunteers, do not share
||that breadth. Since many programs in high-level languages may be
||readily translated from, say, GAUSS to MATLAB, this breadth may be
||very welcome to the user.
3: Relational metadata
The term "RePEc" is initially an acronym; it stands for Research Papers
|in Economics. In fact the term should now be taken to be a literal,
||because RePEc is about more than the
||description of resources. It is probably best to say that RePEc is a
||relational database about Economics as a "discipline". By
||a discipline, we mean a group of scholars, past and present,
||and the intrrelated documnets. That they produce.
We have already
|explored the relational aspect in Subsection 3.1.
||In this Subsection we are interested in refining the view of the
One possible approach to a discipline comes form
|Karlsson and Krichel
||(1999). They have come
||up with a model of the discipline, as consisting essentially of four
||elements arranged in a table
A few words may help to understand that table. A "resource" is
|essentially any output of academic activity, be it a research document, a
||dataset, a computer program, i.e., anything that an academic person would
||claim authorship for. A "collection" is a logical grouping of
||resources. For example the act of peer review may be represented by a
||resource being included in a collection. A "person" is a physical person
||or a corporate body who acts as a physical person in the context of RePEc.
These data collectively form a relational database that not only describes
|papers, but also the authors who write them, the institutions where they
||work, and so on. This leads the profession towards a shared dataset where
||all important aspects of economists' work may be collected.
A second approach to the concept of a "discipline" is the idea of the
|"authority" as described in the ReDIF documentation. An authority is a
||group of persons who uses ReDIF to document its activities using a common
||identifier space. Therefore disciplines can be viewed as naming
||authorities who ensure that every elements in the dataset that are
||different have different identifiers.
From the material that we have covered in the previous section,
|we can draw a simple model of RePEc as|
||archives ===> One
||dataset ===> Many
In this section we seek to understand what the dataset actually
|contains. Some of the contents have already been disclosed in
||Subsection 2.7. To understand what other data is
||contained in RePEc we need to dig a little deeper. The data is encoded in
||a format called ReDIF. We introduce this format in the first Subsection.
3.1: A closer look at the contents
To understand the basics of ReDIF it is best to start with an
|example. Here is a--carefully selected--piece
||of ReDIF data at ftp://www.econ.surrey.ac.uk/pub/RePEc/sur/surrec/surrec9601.rdf. (I suppress
||the Abstract: field to conserve space.)
|Template-Type: ReDIF-Paper 1.0
||Title: Dynamic Aspect of Growth and Fiscal
||Author-Name: Thomas Krichel
||Author-Name: Paul Levine
||Author-WorkPlace-Name: University of Surrey
||Classification-JEL: C61; E21; E23; E62; O41
|| When we look at this record, the ReDIF data appears like a
||standard bibliographical format, with authors, title etc. The only thing
||that appears a bit mysterious here is the Author-Person field. This field
||quotes a handle that is known to RePEc. This handle leads to a
||record maintained at
||ftp://netec.mcc.ac.uk/pub/RePEc/per/pers/RePEc_per_1965-06-05_THOMAS_KRICHEL.rdf (I leave out a few
||fields to conserve space.)
|Template-Type: ReDIF-Person 1.0
||Name-Full: KRICHEL, THOMAS
||Postal: 1 Martyr Court
|| 10 Martyr Road
|| Guildford GU1 4LF
In this record, we have the handles of documents that the person has
|written. This record will allow user services to list the complete papers
||by a given author. This is obviously useful when we want to find papers
||that one particular author has written. It is also useful to have a central
||record of the person's contact details. This eliminates the need to update
||the relevant data elements on every document record. In fact the record on the
||paper template may be considered as the historical record that is valid at
||the time when the paper was written, but the address in the
||person template is the one that is currently valid.
In the person template, we find another RePEc identifier in the
|"Workplace-Institution" field. This points to another record
||that describes the institution
|Template-Type: ReDIF-Institution 1.0
||Primary-Name: University of Surrey
||Secondary-Name: Department of Economics
||Secondary-Phone: (01483) 259380
||Secondary-Fax: (01483) 259548
||Secondary-Postal: Guildford, Surrey GU2 5XH
It would take us too far here to discuss this record in more
|detail. It is probably more interesting to know where these records come from.
||This is what I address in Subsection 3.2 and
||Subsection 3.3 for the institutional and the personal
3.2: Institutional registration
The registration of institutions is accomplished through the EDIRC project. The
|acronym stands for "Economics Departments, Institutions and Research
||Centers". This dataset has been compiled by Christian
||Zimmermann, an Associate
||Professor of Economics at Unversité du Québec à
||Montréal on his own account, as a public service to
||the economics profession. The initial intention was to compile a directory
||with all economics departments that have a Web presence. Since there are
||many departments that have a Web presence now, a large number are now
||registered, about 5,000 of them at the time of this
||writing. All these records are
||included in RePEc. For all institutions, data on their homepage is
||available, as well as postal and telephone information. For some, there is
||even data on their main area of work. Thus it is possible to find a list of
||institutions where--for example--a lot of work in labor
||economics in being done. At the moment, EDIRC is mainly linked to the rest
||of the RePEc data through the HoPEc personal registration service. Other
||links are possible, but are rarely used.
3.3: Personal registration
HoPEc has a different
|organization from EDIRC. It is impossible for a single academic to register
||all persons who are active in economics. One possible approach would be to
||ask archives to register people who work at their institution. This will
||make archive maintainers' work more complicated, but the overall
||maintenance effort will be smaller once all authors are registered.
||However, authors move between archives, and many have work that appears in
||different archives. To date there is no satisfactory way to
||deal with moving authors. Therefore the author registration
||is carried out using a centralized system.
A person who is registered with HoPEc is identified by a string that
|is usually close to the person's name, and by a date that is
||significant to the registrant. HoPEc suggests the birthday but any
||other date will do as long as the person can remember it. When
||registrants works with the service, they first supply some personal
||information. The data that is requested is mainly the name, the URL of
||the registrant's homepage and the email address. Registrants are free
||to enter data about their academic interests--using the
||Journal of Economic Literature Classification
||EDIRC handle of their primary affiliation.
When the registrant has entered this data, the second step is to
|create associations between the record of the registrant and the
||document data that is contained in RePEc. The most common association
||is the authorship of a paper. However, other associations are
||possible, like for example the editorship of a series. The
||registration service then looks up the name of the registrant in the
||RePEc document database. The registrant can then decide which
||potential associations are relevant. The authentication methods are
||weak. HoPEc relies on honesty.
There are several significant problems that a service like HoPEc
|faces. First since there is no historical precedent for such a service, it
||is not easy to commonicate the raison d'être of the service to a
||potential registrant. Some people think that they need to register in
||order to use RePEc services. While this delivers valuable information about
||who is interested in using RePEc services--or more precisely who is
||too dumb to grasp that these services do not require
||registration--it clutters the database with records of limited
||usefulness. Last but by no means least, there are all kinds of privacy
||issues involved in the composition of such a dataset. For example, Sune
||Karlsson has informed me that setting up
||a database such as HoPEc would be illegal in Sweden.
To summarize, HoPEc provides information about persons' identity,
|affiliation and research interests, and links these data with resource
||descriptions in RePEc. This allows to identify persons and update
||their metadata in a timely and cost efficient way. These data could also
||fruitfully be employed for other purposes, such as maintaining membership
||data for scholarly societies, or lists of conference participants. It is
||hoped that the HoPEc data will be used in that way in the future as a
||shared pool of common personal data.
3.4: The nature of RePEc revisited
The EDIRC and HoPEc projects have a dual rôle. They both provide
|data to the RePEc dataset and they provide an interface for
||users. Thus they are both at odds with the three-layer model I
||outlined in the opening of the section. In fact the HoPEc service is
||more at odds with the model because it proposes adding data through a
4: The Open Library
This section of the paper is somewhat more theoretical. It sets out a body
|of thought that is built on the experience of RePEc. It is an attempt to
||find a general theory that could apply in a wide set of circumstances in
||similar systems are desirable. I call this general concept "Open
||Library". The parallel to the "Open Source" concept is obvious and
||intentional. In more ways than one, the Open Library intends to be to
||bibliographic data--a term that should be taken here in the widest
||sense--what the Open Source is to computer code. I will
||come back to the parallel between Open Source and Open Library in the
||concluding Section 5.
4.1: The definition of the Open Library
The Open Library is based on a view of the world where any object that one
|is potentially interested in can be conceptually divided into its body and
||its data. The body is either the physical manifestation of the object
||or--in case where such a manifestation does not exist or is
||irrelevant--it is the part of the object that is non-exclusive use.
||By non-exclusive use I mean simultaneous use by any number of people.
Consider as an example Thomas Krichel. Then the body is his physical body,
|and the data are the written works that he leaves open to be seen, the way
||he looks, what he says. If we look at a more famous economist say David
||Ricardo, then his body is not important, because he is dead. However he
||lives on through his data. As a final example, take the Journal of Monetary
||Economics, published by Elsevier. In that example, there is no physical
||manifestation but rather, the body are the papers in the journal
||itself. The data are the fact that a journal exists, that it has xyz famous
||economist as its editor, and maybe the title of papers etc.
The data aspect of an object has a public-domain characteristic that the
|body of the object does not have. The data aspect can therefore be
||summarized. Such a summary of the data aspect of an object is usually
||called metadata. Although the distinction between metadata and data is
||sometimes arbitrary, it remains a useful concept here.
An Open Library can then be defined as follows. An Open Library is a
|collection of metadata records about the data aspect of interesting
||objects, that has the following characteristics
| Every record is identified by a unique handle.
||This requirement distinguishes the library from an archive. It
||allows for every record to be addressed in an unambiguous way. This is
||important if links between records are to be established.
| Records have a homogeneous syntax of field names and field values.
||This requirement constrains the Open Library to appear like a
||database. If this requirement would not be present, all public access
||pages on the WWW would form an Open Library. Note that this
||requirement does not constrain the Open Library to contain a
||homogeneous record format.
| The documentation of the record format is available for
||online public access.
||For example, a collection encoded in MARC format
||would not qualify as an Open Library because access to the documentation of
||MARC is restricted.
||Without this requirement the cost of acquiring the documentation would be an
||obstacle to participation.
| The collection is accessible on a public access computer system.
||This is the precondition to allow for the construction of user services.
||Note that user services may not necessarily be open to public access.
| Contributions to the collection may be made without monetary cost.
||There are of course non-monetary costs to contribute to the Open
||Library. However the general principle is that there is no need to pay
||for either contributing or using the library. The copyright status of
||data in an Open Library should be subject to further research.
4.2: The Open Library and the Open Archive
Stimulated by work of
|Van de Sompel, Krichel,
||Nelson et al. (2000), there have been recent
||moves towards improving the interoperability of e-print archives such
||as arXiv.org, NCSTRL, RePEc etc. This
||work is now called the
||Open Archive Initative (OAi), a term coined by Stevan
||Harnad. The basic business
||model proposed by the OAi is very close to the RePEc project. However
||there is no official formal definition of what an Open Archive
||is. There are two possible approaches.
The Open Archive initiative grew out of e-print initiatives that
|provide archiving of academic documents on the Internet. However the
||basic idea of an Open Archive appears to be much more general. Any archive
||that would store digital objects and that would allow users to poll metadata
||for these objects could be called an Open Archive. This implies that
||the Open Archive is a more general concept than the Open
||Library. Different Open Archives may contain very heterogeneous
||metadata for highly diverse objects. This heterogeneity may make it
||impractical or undesirable to combine data for all archives in an Open
||Library that would require a homogeneous metadata format.
A more restrictive view of an Open Archive may be that it is any
|archive that implements the Santa Fe
||Convention (SFC), as
||set out by the OAi. If an archive is open in the sense that it
||implements the SFC, then it will deliver a minimal metadata set for
||all its objects. All SFC compatible archives supply at least a minimum
||metadata, called the Open Archives Metadata
||Set. Implementing the
||agreement also means identifying each object in the library. This
||implies that the data held on all SFC compliant archives forms an Open
||Library. However that Open Library built on the minimal metadata set
||will hopefully not be be the only Open Library to be built on the
||object held in SFC compatible Open Archives.
4.3: The conceptual challenge raised by the Open Library
The Open Library as defined in Subsection 4.1
|may be a relatively obvious concept. It certainly is not an elaborate
||intellectual edifice. Nevertheless the Open Library idea raises
||some interesting conceptual challenges.
|4.3.1: Supply of information|
To me as an outsider to the Library and Information Studies (LIS) discipline,
|there appears to be a tradition of emphasizing the behavior of the user who
||demands information rather than the publisher--I use the word here in
||its widest sense--who supplies it. I presume this orientation comes from
||the tradition that almost all bibliographic data were supplied by commercial
||companies, just as all documents that they describe. Libraries then see their
||rôle as intermediaries between the commercial supply and the general public.
||In that scenario libraries take the supply of documents and metadata as given.
The Open Library proposes to build new supply chains for metadata. If
|libraries want to have a share in the Open Library, they will need to
||become more active in the supply of metadata. If all libraries contribute
||metadata about objects that are local to them--what that means would
||have to be defined--then a large open library can be built.
An Open Library will only be as good as the data that contributors will give to
|it. It is therefore important that research be conducted on what data
||contributors are able to contribute; on how to provide documentation that the
||contributor can understand; and on understanding a contributor's motivation.
|4.3.2: Digital updatability|
For a long time the library profession has purchased material that is
|essentially static. It may be subject to physical decay but the
||material that it contains is immutable. Digital resource have made
||mass appearance only a few years ago. These resources may be changed
||at any time. The change from static to dynamic resource is a major
||challenge for the LIS profession. Naturally the inclination has been
||to demand that the digital resources be like the non-digital resource
||in all but their physical medium. The debate on digital preservation is
||a result of that demand. Thus the dynamic nature of digital metadata
||has been seen more as a threat rather than as an opportunity. The Open
||Library is more concerned with digital updatability than
|4.3.3: Metadata quality control|
In the case of a decentralized dataset, an important problem is to
|maintain metadata quality. Some elements of metadata quality can not
||be controlled by a computer, but others can. One such element is the
||integrity of handles, another is the use of controlled vocabulary. A
||metadata system has to be designed that allows for the control of
||syntactical and relational checks on the metadata. ReDIF allows for
||syntactical control, but not as yet for relational control. Highly
||controllable metadata systems are an important research concern that
||is related to the Open Library concept. One way to achieve quality
||control is through the use of relational metadata. Each record has an
||identifier. Records can use the identifiers of other records. It is
||then possible to update elements of the dataset in an independent way.
||It is also quite trivial to check if the handle referenced in one
||record corresponds to a valid handle in the dataset.
When Richard Stallman founded the
|GNU project in 1984 to
||write a free operating system to replace Unix, few believed that such an
||operating system would ever come about. The same may hold for my audience
||today, when I am calling for an Open Library. But remember
||that in the late 90s the Open
||Source movement has basically realized Stallman's dream.
Building GNU took a long time. But the obstacles
|facing the open source movement has are much more daunting then
||the obstacles facing the Open Library movement: |
|structural complexity of the operating system of a modern computer is much
||higher than the structural complexity of a metadata
||collection. Computer programming is a highly profitable activity for
||the individual who is capable of doing it, therefore the opportunity cost
||of participating in what is essentially an unpaid activity is much
||higher. These costs are much lower for the academic or the academic
||librarian who would participate in an Open Library construction.
||There is a network effect that arises when the Open Library has reached a
||critical mass. At some stage the cost of providing data is much smaller
||than the benefit--in terms of more efficient
||dissemination--of contributing data. When that stage is reached, the
||Open Library can grow without external public or private subsidy. It
||remains to be seen how much inroad the Open Library concept
||will make. I will dedicate a considerable part of my professional energies
||towards advancing the concept. I have registered the domain
||http://openlib.org to serve as a repository for
||Open Library work. Naturally the work of the Open Library will start with
||academic metadata. Openlib.org is therefore sponsoring work on an academic
||metadata format (http://openlib.org/amf).
This paper was presented at the
||at the University of Michigan
||on 2000-03-24. On 2001-03-05,
||I made cosmetic changes to this document
||as suggested by
||Jeffrey K. MacKie-Mason.
||These suggestions have much improved the readability of the paper
||without updating its contents.