|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| Working towards an Open Library for Economics: The RePEc project
|
|
|
|
|
|
|
|
| Working towards an Open Library for Economics: The RePEc project
|
| 2000-03-24
| 2000-03-24
|
|
|
|
|
| Abstract
|
| After arXiv.org, the RePEc Economics library offers the
| second-largest source of freely downloadable scientific preprints in the
| world. RePEc has a different business model and a different content
| coverage than arXiv.org. This paper addresses both differences.
| As far as the business model is concerned, RePEc is an instance of a
| concept that I call the "Open Library". An Open Library is open in
| two ways. It is open for contribution (third parties can add to it),
| and it is open for implementation (many user services may be
| created). Conventional libraries--including most digital
| libraries--are closed in both directions.
| As far as the content coverage is concerned, RePEc seeks to build a
| relational dataset about scholarly resources and other aspects of
| reality that are related to these resources. This basically means
| identifying all authors, all papers and all institutions that work in
| economics. Such an ambitious project can only be achieved if the cost
| to collect metadata is decentralized and low, and if the benefits to
| supply metadata are large. The Open Library provides a framework where
| these conditions are fulfilled.
|
|
|
| This paper is
| available in PDF for A4
| paper, and for letter
| size paper.
|
|
|
| 1: Introduction
| Digital or digitisable data is supplied by publishers, to be consumed
| by readers. Reports of research results in research "papers" form
| the bulk of academic digital or digitisable data, and I will refer to
| these as documents in following.
| In this paper I am not concerned with the demand for document, nor am
| I not concerned with the supply of documents. Instead I focus on the
| supply of information about documents. For some documents holding
| detailed information about the document is as good as holding the
| document itself. This is typically the case when the document can be
| accessed on the Internet without any access restriction. Such a
| document will be called a public access document. Collecting metadata
| is therefore particularly relevant for public access documents.
| The main idea that is brought forward in this paper is the "Open
| Library". Basically an Open Library is a collaborative framework for the
| supply and usage of document metadata. Stated in this way the idea of
| the Open Library is quite trivial. To fully appreciate the concept, it
| is useful to study one Open Library in more detail. My example is the
| RePEc dataset about Economics. In
| Section 2 I introduce RePEc as a document
| metadata collection. In Section 3 I push the
| RePEc idea further. I discuss the extension of RePEc that allows one to
| describe the discipline, rather than simply the documents that are
| produced by the members of the discipline. In Section
| 4 I make an attempt to define the Open Library more
| precisely. The example of RePEc demonstrates
| the relevance of the Open Library concept. I conclude the
| paper in Section 5.
| In this article I have deliberately chosen to stay away from many
| technical issues that arise in the context of RePEc and that have to
| be addressed by any Open Library. Some more technical aspects have
| nevertheless been included in the appendices. Appendix
| 1 lists the main RePEc user
| services. Appendix 2 discusses technical
| aspects of the ReDIF metadata format that is used by RePEc.
| The efforts of which RePEc is the result go back to 1992. I
| deliberately stayed away from a description of the history of the work
| to concentrate on the current status. Therefore insufficient
| attribution is given to the people how have earned historic merits by
| contributing to the RePEc effort.
|
| 2: The RePEc document dataset
|
| 2.1: Origin and motivation of RePEc
| A scholarly commonication system brings together producers and consumers of
| documents. For the majority of the documents, the producers do not receive a
| monetary reward. Their effort is compensated through a wide circulation of the
| document and a mark of peer approval for it. Dissemination and peer approval
| are the dual functions of the commonication system.
| Scholarly commonication in economics has largely been
| journal-based. Peer review plays a crucial rôle. Thorough peer
| review is expensive in time. According to
| Trivedi
| (1993), it is common that a paper takes over
| three years from submission to publication in an academic journal, not
| counting rejections. From informal evidence publication delays have
| slowly risen, but stabilized in the past few years as journal editors
| have fought hard to cut down on what have been perceived to be
| intolerable delays.
| Researchers at the cutting edge cannot rely solely on journals to
| keep abreast of the frontier of research. Prepublication through
| discussion papers or conference proceedings is now commonplace. Access to
| this informally disseminated research is often limited to a small number of
| readers. It relies on the good will of active researchers to disseminate
| their work. Since good will is is short supply, insider circles are common.
| This time gap between informal distribution and formal publication can
| only fundamentally be resolved by reforming the quality control
| process. The inconvenience resulting from the delay can however be
| reduced by improving the efficiency of the informal commonication
| system. This is the initial motivation behind the RePEc
| project. Therefore its traditional emphasis has been on documents that
| have not gone through peer review channels. These documents are often
| available for public access, and therefore a metadata collection is
| particularly useful.
|
| 2.2: Towards an Internet-based scholarly dissemination system
| The Internet is a cost-effective means for scholarly dissemination. Many
| economics researchers and their institutions have established Web sites.
| However they are not alone in offering pages on the Web. The Web has grown
| to an extent that the standard Internet search engines only cover a
| fraction of the Web, and that fraction is decreasing over time
| (Lawrence and Giles
| 1999). Since much of the Economics
| research uses common terms such as "growth", "investment" or "money",
| it is likely that a subject search on the entire Web would yield an
| enormous amount of hits. There would be no practical way to find which
| pages contain economic research. Due to this low signal to noise ratio, the
| Web per se does not provide an efficient mechanism for scholarly
| dissemination. An additional classifying scheme is required to segregate
| references to materials of interest to the economics profession.
| The most important type of material relevant to scholarly
| dissemination are research papers. One way to organize this type of
| material has been demonstrated by the
| arXiv.org preprint archive, founded in 1991
| by Paul Ginsparg of the Los Alamos National Laboratory with an initial
| subject area in high energy physics. Authors use that archive to
| upload papers, which remain stored there. ArXiv.org has now assembled
| over 100,000 papers, covering a broad subject range of mathematics,
| physics and computer science, but concentrating on the original
| subject area. An attempt has been made to emulate the arXiv.org
| system in economics with the "Economics Working Paper
| Archive" (EconWPA) based at Washington
| University St. Louis. Its success has been limited. There are a
| number of potential reasons: Individual economists often
| do not issue preprints: rather, economics departments and research
| organizations issue working papers. Economists use a wider
| variety of document formatting tools than physicists. This reduces the
| functionality of online archiving and makes it more difficult to
| construct a good archive. In their book of tales economists read
| that monopolies are bad for welfare. Accordingly, they may be
| suspicious of any effort to centralize the output of the discipline on
| one central system. Economists are not very computer literate,
| therefore they will have problems with uploading procedures.
| Economics starts with the assumption that people are rational and
| selfish. Therefore coordination on a voluntary scheme is difficult to
| achieve.
| An ideal system should combine the decentralized nature of
| the Web, the centralized nature of the arXiv.org archive, and a zero
| price to end users in order to reach the maximum number of
| readers. The first two objectives could be accomplished if departments
| and research centers allow for public access to their documents on the
| Internet. But then we need a library to catalog these papers. The
| library would collect what is known as "metadata": data about
| documents that are available using Internet protocols. This would
| however fail to satisfy the third objective of zero cost to users. The
| impediments to creation of that framework are thus not technical but
| economic in nature. There is no incentive for any single institution
| to bear the cost of establishing a comprehensive metadata collection,
| with aspects of a public good, without external subsidy. However,
| since every institution will benefit from participation in such an
| effort, we may solve this incentive problem by creating a virtual
| collection via a network of linked metadata archives. Each institution
| need only maintain their own collection of document metadata using a
| set of standardized templates. This strategy is the basic business
| model of the RePEc system.
|
| 2.3: RePEc as an electronic document exchange
| A scholarly dissemination system on the Internet
| should start by enhancing the pre-Internet practice
| rather than attempting to replace it. The distribution of
| informal research papers in the past has been based on institutions issuing
| working papers. These are circulated through exchange arrangements. RePEc
| is a way to organize this process on the Internet. There are three concepts
| that are crucial to that process: archive, site and service.
| An "archive" is a space on a public access computer system
| which makes document metadata available. It is a place where original data enters the
| system. There is no need to run any software other than an ftp or http
| daemon that makes the files in the archive available upon request. A
| "site" is a collection of archives on the same computer system. It
| usually consists of a local archive augmented by frequently updated
| ("mirrored") copies of remote archives. A "service" is a rendering
| of RePEc data in a form that is available to the end user. I will
| discuss these concepts in turn.
|
| 2.4: Archives
| RePEc allows researchers' departments and research institutes to
| participate in a decentralized archival scheme which makes information
| about the documents that they publish accessible via the Internet.
| Individual researchers may also open archives, but they are encouraged
| to use EconWPA. A contributor places metadata about its documents on
| a public access computer system. This is usually an anonymous ftp
| server or a Web server. This metadata must be in the ReDIF format.
| If a document is available online, a link may be provided to the place
| where the paper may be downloaded. Note that the document may not only
| be the full text of an academic paper, but it may also be an ancillary
| files, e.g. a dataset or a computer programs.
| Each participating institution has total control over the contents of
| its archive. There is no need to transmit documents elsewhere. The
| archive management retains the liberty to post revisions or to
| withdraw a document.
| Participation does not imply that the documents are freely
| available. Thus, a number of journals have also permitted their
| contents to be listed in RePEc. If the person's institution has made the
| requisite arrangements with publishers (e.g. JSTOR for back issues of
| Econometrica or Journal of Applied Econometrics), RePEc will contain links
| to directly access the documents.
|
| 2.5: Sites
| A site is a local archive plus copies of a number of remote archives
| that are regularly updated, i.e. mirrored. Everything contained in an
| archive may be mirrored. For example, if a document is in the archive,
| it may be mirrored. If the archive management does not wish the
| document to be mirrored, it can store it outside the archive. The
| advantage of this remote storage is that the archive maintainer will
| get a complete set of access logs to the file. The disadvantage is
| that every request for the file will have to be served from the local
| archive rather than from the RePEc site that the user is accessing.
| An obvious way to organize the mirroring process would be to mirror
| the data of all archives to a central location. This central location
| would in turn be mirrored to the other RePEc sites. The founders of
| RePEc did not adopt that solution, because it would be quite
| vulnerable to mistakes at the central site. Instead each site installs
| the mirroring software and mirrors "on its own", so to speak. Not
| all of them adopt the same frequency of updating. Some may update
| daily, while some may only update weekly. One disadvantage of this
| system is that it is not known how long it takes for a new item to be
| propagated through the system.
|
| 2.6: Services
| Users access RePEc data through user services. Within reason,
| services are free to use all, or only a subset of the RePEc data. For
| example, one service may only show papers that are available
| electronically, another may restrict the choice to act
| as a quality filter. In this way services implement constraints on the
| data, whether they be availability constraints or quality
| constraints. Each service has its own name, i.e. there is no
| "official" RePEc user service. A list of services currently at the
| time of writing may be found in Appendix 1.
| The RePEc data may not be sold or incorporated into a product that is
| sold. Therefore all RePEc services are free. User services compete
| through quality rather than price. All RePEc archives benefit from
| simultaneous inclusion in all services. This leads to an efficient
| dissemination that a proprietary system can not afford.
|
| 2.7: The documents available through RePEc
| About 160 archives in 25 countries currently participate in RePEc,
| some of them representing several institutions. Over 80 universities
| contribute their working papers, including in the U.S. Berkeley,
| Boston College, Brown, Maryland, MIT, Iowa, Iowa State, Ohio State,
| UCLA, and Virginia. The RePEc collection also contains information on
| all NBER Working Papers, the CEPR Discussion Papers, the contents of
| the Fed in Print database of the US Federal Reserve, and complete
| paper series from the IMF, World Bank and OECD, as well as the
| contributions of many other research centers and institutes worldwide.
| Last but not least, RePEc also includes the holdings of EconWPA. In
| total, at the time of writing, over 35,000 items are
| downloadable. My estimate is that at least 20,000 of these
| items have no access restrictions to full-text.
| That is the largest distributed set of freely
| downloadable research papers in the world. Up to date
| figures may be obtained from the
|
| RePEc Web site.
| The bibliographic templates describing each item currently provide for
| papers, articles, and software components. The article templates are
| used to fully describe published articles, and are currently in use by
| the Canadian Journal of Economics, Econometrica, the Federal Reserve
| Bulletin, and IMF Staff Papers, the Journal of Applied Econometrics,
| the RAND Journal of Economics, to name only a few of the participating
| journals. Participation does not imply that the articles are freely
| available.
| The RePEc collection of metadata also contains links to several
| hundred "software components"--functions, procedures, or code
| fragments in the Stata, Mathematica, MATLAB, Octave, GAUSS, Ox, and
| RATS languages, as well as code in FORTRAN, C and Perl. The ability to
| catalog and describe software components affords users of these
| languages the ability to search for code applicable to their
| problem--even if it is written in a different language.
| Software archives that are restricted to one language, such as those
| maintained by individual software vendors or volunteers, do not share
| that breadth. Since many programs in high-level languages may be
| readily translated from, say, GAUSS to MATLAB, this breadth may be
| very welcome to the user.
|
| 3: Relational metadata
| The term "RePEc" is initially an acronym; it stands for Research Papers
| in Economics. In fact the term should now be taken to be a literal,
| because RePEc is about more than the
| description of resources. It is probably best to say that RePEc is a
| relational database about Economics as a "discipline". By
| a discipline, we mean a group of scholars, past and present,
| and the intrrelated documnets. That they produce.
| We have already
| explored the relational aspect in Subsection 3.1.
| In this Subsection we are interested in refining the view of the
| discipline.
| One possible approach to a discipline comes form
| Karlsson and Krichel
| (1999). They have come
| up with a model of the discipline, as consisting essentially of four
| elements arranged in a table
|
|
| |
| resource |
| collection
|
| |
| |
| person |
| institution
|
| |
| |
| A few words may help to understand that table. A "resource" is
| essentially any output of academic activity, be it a research document, a
| dataset, a computer program, i.e., anything that an academic person would
| claim authorship for. A "collection" is a logical grouping of
| resources. For example the act of peer review may be represented by a
| resource being included in a collection. A "person" is a physical person
| or a corporate body who acts as a physical person in the context of RePEc.
| These data collectively form a relational database that not only describes
| papers, but also the authors who write them, the institutions where they
| work, and so on. This leads the profession towards a shared dataset where
| all important aspects of economists' work may be collected.
| A second approach to the concept of a "discipline" is the idea of the
| "authority" as described in the ReDIF documentation. An authority is a
| group of persons who uses ReDIF to document its activities using a common
| identifier space. Therefore disciplines can be viewed as naming
| authorities who ensure that every elements in the dataset that are
| different have different identifiers.
| From the material that we have covered in the previous section,
| we can draw a simple model of RePEc as
| Many
| archives ===> One
| dataset ===> Many
| services
| In this section we seek to understand what the dataset actually
| contains. Some of the contents have already been disclosed in
| Subsection 2.7. To understand what other data is
| contained in RePEc we need to dig a little deeper. The data is encoded in
| a format called ReDIF. We introduce this format in the first Subsection.
|
| 3.1: A closer look at the contents
| To understand the basics of ReDIF it is best to start with an
| example. Here is a--carefully selected--piece
| of ReDIF data at ftp://www.econ.surrey.ac.uk/pub/RePEc/sur/surrec/surrec9601.rdf. (I suppress
| the Abstract: field to conserve space.)
|
|
| Template-Type: ReDIF-Paper 1.0
| Title: Dynamic Aspect of Growth and Fiscal
| Policy
| Author-Name: Thomas Krichel
| Author-Person:
| RePEc:per:1965-06-05:thomas_krichel
| Author-Email: T.Krichel@surrey.ac.uk
| Author-Name: Paul Levine
| Author-Email: P.Levine@surrey.ac.uk
| Author-WorkPlace-Name: University of Surrey
| Classification-JEL: C61; E21; E23; E62; O41
| File-URL: ftp://www.econ.surrey.ac.uk/pub/
| RePEc/sur/surrec/surrec9601.pdf
| File-Format: application/pdf
| Creation-Date: 199603
| Revision-Date: 199711
| Handle: RePEc:sur:surrec:9601
|
|
| When we look at this record, the ReDIF data appears like a
| standard bibliographical format, with authors, title etc. The only thing
| that appears a bit mysterious here is the Author-Person field. This field
| quotes a handle that is known to RePEc. This handle leads to a
| record maintained at
| ftp://netec.mcc.ac.uk/pub/RePEc/per/pers/RePEc_per_1965-06-05_THOMAS_KRICHEL.rdf (I leave out a few
| fields to conserve space.)
|
| Template-Type: ReDIF-Person 1.0
| Name-Full: KRICHEL, THOMAS
| Name-First: THOMAS
| Name-Last: KRICHEL
| Postal: 1 Martyr Court
| 10 Martyr Road
| Guildford GU1 4LF
| England
| Email: t.krichel@surrey.ac.uk
| Homepage: http://openlib.org/home/krichel
| Workplace-Institution: RePEc:edi:desuruk
| Author-Paper: RePEc:sur:surrec:9801
| Author-Paper: RePEc:sur:surrec:9702
| Author-Paper: RePEc:sur:surrec:9601
| Author-Paper: RePEc:rpc:rdfdoc:concepts
| Author-Paper: RePEc:rpc:rdfdoc:ReDIF
| Handle: RePEc:per:1965-06-05:THOMAS_KRICHEL
|
|
| In this record, we have the handles of documents that the person has
| written. This record will allow user services to list the complete papers
| by a given author. This is obviously useful when we want to find papers
| that one particular author has written. It is also useful to have a central
| record of the person's contact details. This eliminates the need to update
| the relevant data elements on every document record. In fact the record on the
| paper template may be considered as the historical record that is valid at
| the time when the paper was written, but the address in the
| person template is the one that is currently valid.
| In the person template, we find another RePEc identifier in the
| "Workplace-Institution" field. This points to another record
| at ftp://crefe.dse.uqam.ca/pub/RePEc/edi/inst/desuruk.rdf
| that describes the institution
|
| Template-Type: ReDIF-Institution 1.0
| Primary-Name: University of Surrey
| Primary-Location: Guildford
| Secondary-Name: Department of Economics
| Secondary-Phone: (01483) 259380
| Secondary-Email: economics@surrey.ac.uk
| Secondary-Fax: (01483) 259548
| Secondary-Postal: Guildford, Surrey GU2 5XH
| Secondary-Homepage:
| http://www.econ.surrey.ac.uk/
| Handle: RePEc:edi:desuruk
|
|
|
| It would take us too far here to discuss this record in more
| detail. It is probably more interesting to know where these records come from.
| This is what I address in Subsection 3.2 and
| Subsection 3.3 for the institutional and the personal
| data, respectively.
|
| 3.2: Institutional registration
| The registration of institutions is accomplished through the EDIRC project. The
| acronym stands for "Economics Departments, Institutions and Research
| Centers". This dataset has been compiled by Christian
| Zimmermann, an Associate
| Professor of Economics at Unversité du Québec à
| Montréal on his own account, as a public service to
| the economics profession. The initial intention was to compile a directory
| with all economics departments that have a Web presence. Since there are
| many departments that have a Web presence now, a large number are now
| registered, about 5,000 of them at the time of this
| writing. All these records are
| included in RePEc. For all institutions, data on their homepage is
| available, as well as postal and telephone information. For some, there is
| even data on their main area of work. Thus it is possible to find a list of
| institutions where--for example--a lot of work in labor
| economics in being done. At the moment, EDIRC is mainly linked to the rest
| of the RePEc data through the HoPEc personal registration service. Other
| links are possible, but are rarely used.
|
| 3.3: Personal registration
| HoPEc has a different
| organization from EDIRC. It is impossible for a single academic to register
| all persons who are active in economics. One possible approach would be to
| ask archives to register people who work at their institution. This will
| make archive maintainers' work more complicated, but the overall
| maintenance effort will be smaller once all authors are registered.
| However, authors move between archives, and many have work that appears in
| different archives. To date there is no satisfactory way to
| deal with moving authors. Therefore the author registration
| is carried out using a centralized system.
| A person who is registered with HoPEc is identified by a string that
| is usually close to the person's name, and by a date that is
| significant to the registrant. HoPEc suggests the birthday but any
| other date will do as long as the person can remember it. When
| registrants works with the service, they first supply some personal
| information. The data that is requested is mainly the name, the URL of
| the registrant's homepage and the email address. Registrants are free
| to enter data about their academic interests--using the
| Journal of Economic Literature Classification
| Scheme--and the
| EDIRC handle of their primary affiliation.
| When the registrant has entered this data, the second step is to
| create associations between the record of the registrant and the
| document data that is contained in RePEc. The most common association
| is the authorship of a paper. However, other associations are
| possible, like for example the editorship of a series. The
| registration service then looks up the name of the registrant in the
| RePEc document database. The registrant can then decide which
| potential associations are relevant. The authentication methods are
| weak. HoPEc relies on honesty.
| There are several significant problems that a service like HoPEc
| faces. First since there is no historical precedent for such a service, it
| is not easy to commonicate the raison d'être of the service to a
| potential registrant. Some people think that they need to register in
| order to use RePEc services. While this delivers valuable information about
| who is interested in using RePEc services--or more precisely who is
| too dumb to grasp that these services do not require
| registration--it clutters the database with records of limited
| usefulness. Last but by no means least, there are all kinds of privacy
| issues involved in the composition of such a dataset. For example, Sune
| Karlsson has informed me that setting up
| a database such as HoPEc would be illegal in Sweden.
| To summarize, HoPEc provides information about persons' identity,
| affiliation and research interests, and links these data with resource
| descriptions in RePEc. This allows to identify persons and update
| their metadata in a timely and cost efficient way. These data could also
| fruitfully be employed for other purposes, such as maintaining membership
| data for scholarly societies, or lists of conference participants. It is
| hoped that the HoPEc data will be used in that way in the future as a
| shared pool of common personal data.
|
| 3.4: The nature of RePEc revisited
| The EDIRC and HoPEc projects have a dual rôle. They both provide
| data to the RePEc dataset and they provide an interface for
| users. Thus they are both at odds with the three-layer model I
| outlined in the opening of the section. In fact the HoPEc service is
| more at odds with the model because it proposes adding data through a
| user service.
|
| 4: The Open Library
| This section of the paper is somewhat more theoretical. It sets out a body
| of thought that is built on the experience of RePEc. It is an attempt to
| find a general theory that could apply in a wide set of circumstances in
| which
| similar systems are desirable. I call this general concept "Open
| Library". The parallel to the "Open Source" concept is obvious and
| intentional. In more ways than one, the Open Library intends to be to
| bibliographic data--a term that should be taken here in the widest
| sense--what the Open Source is to computer code. I will
| come back to the parallel between Open Source and Open Library in the
| concluding Section 5.
|
| 4.1: The definition of the Open Library
| The Open Library is based on a view of the world where any object that one
| is potentially interested in can be conceptually divided into its body and
| its data. The body is either the physical manifestation of the object
| or--in case where such a manifestation does not exist or is
| irrelevant--it is the part of the object that is non-exclusive use.
| By non-exclusive use I mean simultaneous use by any number of people.
| Consider as an example Thomas Krichel. Then the body is his physical body,
| and the data are the written works that he leaves open to be seen, the way
| he looks, what he says. If we look at a more famous economist say David
| Ricardo, then his body is not important, because he is dead. However he
| lives on through his data. As a final example, take the Journal of Monetary
| Economics, published by Elsevier. In that example, there is no physical
| manifestation but rather, the body are the papers in the journal
| itself. The data are the fact that a journal exists, that it has xyz famous
| economist as its editor, and maybe the title of papers etc.
| The data aspect of an object has a public-domain characteristic that the
| body of the object does not have. The data aspect can therefore be
| summarized. Such a summary of the data aspect of an object is usually
| called metadata. Although the distinction between metadata and data is
| sometimes arbitrary, it remains a useful concept here.
| An Open Library can then be defined as follows. An Open Library is a
| collection of metadata records about the data aspect of interesting
| objects, that has the following characteristics
|
| Every record is identified by a unique handle.
|
| This requirement distinguishes the library from an archive. It
| allows for every record to be addressed in an unambiguous way. This is
| important if links between records are to be established.
|
| Records have a homogeneous syntax of field names and field values.
|
| This requirement constrains the Open Library to appear like a
| database. If this requirement would not be present, all public access
| pages on the WWW would form an Open Library. Note that this
| requirement does not constrain the Open Library to contain a
| homogeneous record format.
|
| The documentation of the record format is available for
| online public access.
|
| For example, a collection encoded in MARC format
| would not qualify as an Open Library because access to the documentation of
| MARC is restricted.
| Without this requirement the cost of acquiring the documentation would be an
| obstacle to participation.
|
| The collection is accessible on a public access computer system.
|
| This is the precondition to allow for the construction of user services.
| Note that user services may not necessarily be open to public access.
|
| Contributions to the collection may be made without monetary cost.
|
| There are of course non-monetary costs to contribute to the Open
| Library. However the general principle is that there is no need to pay
| for either contributing or using the library. The copyright status of
| data in an Open Library should be subject to further research.
|
| 4.2: The Open Library and the Open Archive
| Stimulated by work of
| Van de Sompel, Krichel,
| Nelson et al. (2000), there have been recent
| moves towards improving the interoperability of e-print archives such
| as arXiv.org, NCSTRL, RePEc etc. This
| work is now called the
| Open Archive Initative (OAi), a term coined by Stevan
| Harnad. The basic business
| model proposed by the OAi is very close to the RePEc project. However
| there is no official formal definition of what an Open Archive
| is. There are two possible approaches.
| The Open Archive initiative grew out of e-print initiatives that
| provide archiving of academic documents on the Internet. However the
| basic idea of an Open Archive appears to be much more general. Any archive
| that would store digital objects and that would allow users to poll metadata
| for these objects could be called an Open Archive. This implies that
| the Open Archive is a more general concept than the Open
| Library. Different Open Archives may contain very heterogeneous
| metadata for highly diverse objects. This heterogeneity may make it
| impractical or undesirable to combine data for all archives in an Open
| Library that would require a homogeneous metadata format.
| A more restrictive view of an Open Archive may be that it is any
| archive that implements the Santa Fe
| Convention (SFC), as
| set out by the OAi. If an archive is open in the sense that it
| implements the SFC, then it will deliver a minimal metadata set for
| all its objects. All SFC compatible archives supply at least a minimum
| metadata, called the Open Archives Metadata
| Set. Implementing the
| agreement also means identifying each object in the library. This
| implies that the data held on all SFC compliant archives forms an Open
| Library. However that Open Library built on the minimal metadata set
| will hopefully not be be the only Open Library to be built on the
| object held in SFC compatible Open Archives.
|
| 4.3: The conceptual challenge raised by the Open Library
| The Open Library as defined in Subsection 4.1
| may be a relatively obvious concept. It certainly is not an elaborate
| intellectual edifice. Nevertheless the Open Library idea raises
| some interesting conceptual challenges.
|
| 4.3.1: Supply of information
| To me as an outsider to the Library and Information Studies (LIS) discipline,
| there appears to be a tradition of emphasizing the behavior of the user who
| demands information rather than the publisher--I use the word here in
| its widest sense--who supplies it. I presume this orientation comes from
| the tradition that almost all bibliographic data were supplied by commercial
| companies, just as all documents that they describe. Libraries then see their
| rôle as intermediaries between the commercial supply and the general public.
| In that scenario libraries take the supply of documents and metadata as given.
| The Open Library proposes to build new supply chains for metadata. If
| libraries want to have a share in the Open Library, they will need to
| become more active in the supply of metadata. If all libraries contribute
| metadata about objects that are local to them--what that means would
| have to be defined--then a large open library can be built.
| An Open Library will only be as good as the data that contributors will give to
| it. It is therefore important that research be conducted on what data
| contributors are able to contribute; on how to provide documentation that the
| contributor can understand; and on understanding a contributor's motivation.
|
| 4.3.2: Digital updatability
| For a long time the library profession has purchased material that is
| essentially static. It may be subject to physical decay but the
| material that it contains is immutable. Digital resource have made
| mass appearance only a few years ago. These resources may be changed
| at any time. The change from static to dynamic resource is a major
| challenge for the LIS profession. Naturally the inclination has been
| to demand that the digital resources be like the non-digital resource
| in all but their physical medium. The debate on digital preservation is
| a result of that demand. Thus the dynamic nature of digital metadata
| has been seen more as a threat rather than as an opportunity. The Open
| Library is more concerned with digital updatability than
| digital preservation.
|
|
| 4.3.3: Metadata quality control
| In the case of a decentralized dataset, an important problem is to
| maintain metadata quality. Some elements of metadata quality can not
| be controlled by a computer, but others can. One such element is the
| integrity of handles, another is the use of controlled vocabulary. A
| metadata system has to be designed that allows for the control of
| syntactical and relational checks on the metadata. ReDIF allows for
| syntactical control, but not as yet for relational control. Highly
| controllable metadata systems are an important research concern that
| is related to the Open Library concept. One way to achieve quality
| control is through the use of relational metadata. Each record has an
| identifier. Records can use the identifiers of other records. It is
| then possible to update elements of the dataset in an independent way.
| It is also quite trivial to check if the handle referenced in one
| record corresponds to a valid handle in the dataset.
|
| 5: Conclusions
| When Richard Stallman founded the
| GNU project in 1984 to
| write a free operating system to replace Unix, few believed that such an
| operating system would ever come about. The same may hold for my audience
| today, when I am calling for an Open Library. But remember
| that in the late 90s the Open
| Source movement has basically realized Stallman's dream.
| Building GNU took a long time. But the obstacles
| facing the open source movement has are much more daunting then
| the obstacles facing the Open Library movement: The
| structural complexity of the operating system of a modern computer is much
| higher than the structural complexity of a metadata
| collection. Computer programming is a highly profitable activity for
| the individual who is capable of doing it, therefore the opportunity cost
| of participating in what is essentially an unpaid activity is much
| higher. These costs are much lower for the academic or the academic
| librarian who would participate in an Open Library construction.
| There is a network effect that arises when the Open Library has reached a
| critical mass. At some stage the cost of providing data is much smaller
| than the benefit--in terms of more efficient
| dissemination--of contributing data. When that stage is reached, the
| Open Library can grow without external public or private subsidy. It
| remains to be seen how much inroad the Open Library concept
| will make. I will dedicate a considerable part of my professional energies
| towards advancing the concept. I have registered the domain
| http://openlib.org to serve as a repository for
| Open Library work. Naturally the work of the Open Library will start with
| academic metadata. Openlib.org is therefore sponsoring work on an academic
| metadata format (http://openlib.org/amf).
|
|
| This paper was presented at the
| PEAK conference
| at the University of Michigan
| on 2000-03-24. On 2001-03-05,
| I made cosmetic changes to this document
| as suggested by
| Jeffrey K. MacKie-Mason.
| These suggestions have much improved the readability of the paper
| without updating its contents.
|
|
| Appendix
|
|
|
|
|
|