Academic self-documentation: which way </td><td>forward for computing, library and information </td><td>science?

Academic self-documentation: which way forward for computing, library and information science?

accepted at ICADL2001

5 November 2001


Thomas Krichel		Simeon M. Warner

Palmer	School		Computer Science Department

Long Island University		Cornell University

720 Northern Boulevard		4130 Upson Hall

Brookville, NY 11548-1300		Ithaca, NY 14853

USA		USA

krichel@openlib.org		simeon@lanl.gov

Abstract

There has been a lot of discussion about the potential for free access to scholarly documents on the Internet. At the turn of the century, there are two major initiatives. These are arXiv, which covers Physics, Mathematics and Computer Science and RePEc, which covers Economics. These initiatives work in very different ways. This paper is the fruit of collaboration between authors working for both initiatives. It therefore reflects the perspective of people working to achieve change, rather than an academic perspective of pure observation.
We first introduce both arXiv and RePEc, to highlight the basic dichotomy between centralized and decentralized archives. We then discuss at some length the history of self-archiving in Computer Science. We introduce recent developments, the Open Archives Initiative and the Academic Metadata Format, focusing on social rather than technical issues. We then discuss how, building on these developments, the self-archiving situation in Computing and Library and Information Science can be improved.

This paper is available in PDF for A4 paper, and for letter size paper.

1: Introduction

Electronic commerce over the Internet is now commonplace. In the majority of cases, delivery of the merchandise occurs off-line. However, purely informational commodities--such as statistical data or pornographic pictures--can both be contracted upon and delivered over the Internet. That affords the holders of such commodities the opportunity to directly contract with customers in the way that was not possible off-line. The new medium thus provides an opportunity for disintermediation.

In the academic world, the debate about the possible extent of disintermediation has concentrated on the rôle of academic publishing. A large part of academic writing attracts no payment from publication. For the sake of simplicity, this paper deals exclusively with scholarly works for which the author receives no payment. These will be referred to as "research papers" or "papers" for short. It is further assumed that the advent of the Internet will not change the reward structure in the academic world. We assume that authors will still want to write without payment. Their aims will be to have a wide dissemination of the output and to achieve a mark of peer recognition.

It has been proposed, most vociferously by Harnad (1995) and in many papers since, that the Internet will lead to free access to academic papers. In 1995 two large-scale initiatives were already well under way to work on realizing this vision. These were the arXiv archive for Physics, and the RePEc dataset for Economics. The Asian countries are particular benefactors from these initiatives. With a budget that is often too small to cover the cost of many scholarly journals, many Asian universities find that the access of papers on the Internet is an invaluable resource. But the corpus of papers is dispersed on the Internet. This makes free online papers hard to find. Many may be modified or withdrawn at any time. This makes it hard to cite and evaluate them. ArXiv and RePEc are the two largest initiatives combating these problems. This paper is the fruit of collaboration between authors from both initiatives. We review arXiv in Section 2 and we examine RePEc in Section 3. In a brief Section 4 we discuss the distinction between centralized and decentralized archives. This serves as a background to Section 5 where we present the history of self-archiving systems in computer science. The following two sections deal with recent developments. We examine aspects of the Open Archives Initiative (OAI) that are relevant to the author self-archiving problem in Section 6. In Section 7 we introduce the Academic Metadata Format that is used for documenting academia in the OAI. In Section 8 we will make suggestions for further work. Section 9 concludes.

2: The arXiv archive

The arXiv e-print archive is the largest and best-known archive of author self-archived scholarly literature. It is discipline-based. It centralized in the sense that all submissions and the master database are at one site. Here we briefly review the history of arXiv and then discuss its place in the scholarly publishing environment.

The arXiv e-print archive, formerly known as xxx, was started in August 1991 by Paul Ginsparg at Los Alamos National Laboratory. The first subject area was theoretical high-energy physics, `hep-th' and it was initially little more than an email reflector for paper exchange. In 1992 an ftp interface was added and the archive expanded to other areas of high energy physics. Development since then has seen steady expansion of the subject areas to cover all of physics, mathematics, computer science and non-linear systems; and the addition of new interfaces and facilities as new technologies have become available. Significant developments have included: web interface (December 1993); automatic PostScript generation from TeX source (June 1995); PDF generation (April 1996); and web upload (June 1996). Recently, arXiv has been at the forefront of the Open Archives Initiative (OAI), which we will be discussing later. During the Summer of 2001, arXiv moved from the Los Alamos National Laboratory to Cornell University.

arXiv serves an estimated 70,000 users in over 100 countries. There were about 13,000,000 downloads of papers in 2000. There were over 30,000 submissions in 2000. The number of new submissions is growing approximately linearly, with about 3,500 additional new submissions each year. The growth in submissions has had little impact on the administration of the arXiv although there is an increasing reliance on moderators to assure appropriateness. More that 98% of the submissions are entirely automated: 68% of them via the web, 27% via email and 5% via ftp. It is interesting to note that if just 15 minutes were required for each submission then a full-time staff of 7 people would be required to deal with new submissions. Instead, arXiv uses less than one full-time equivalent to deal with day-to-day operations.

There are two important factors behind the growth of arXiv. The high-energy physics commonity uses the TeX text formatting system almost exclusively, and this has been very convenient for arXiv. arXiv does not accept preprocessed TeX submissions. Authors must submit the source. This allows generation of various types of output including DVI, PostScript in several flavors, and PDF. Expansion into other areas of physics means that there are now an increasing number of non-TeX submissions and this trend is sure to continue. Unfortunately, many common word-processing packages produce very inefficient and sometimes low-quality output unless used expertly. Experience shows that PostScript or PDF submissions require greater screening effort than TeX submissions. This is an example of how the physics and mathematics commonities differ from other commonities in a way that has favored author self-archiving.

A second factor behind the growth of arXiv is long-term funding. The Los Alamos site is funded by the US National Science Foundation and the US Department of Energy. Its 15 mirror sites around the world are funded independently, the cost is just a few thousand dollars for a new machine every few years and a small amount of system administration effort.

arXiv has not been an academic exercise, it was started from inside the commonity it aimed to serve. At all stages of expansion to cover a wider subject area, arXiv has been guided and promoted by members of the new fields. Nowadays, some conventional publishers tacitly acknowledge the legitimacy of arXiv by accepting submissions where the author simply quotes an arXiv identifier. Policies vary on whether the publisher permits the author to update the arXiv version to reflect changes made during the refereeing process. However, authors often ignore any prohibitions. In the longer run, there may emerge a scenario where authors and journals rely on arXiv to providing a digital archive. Journals will then simply contain reviews of papers that are deposited at arXiv. Geometry and Topology is an example of such an "overlay journal". The presence of a central archive and a range of decentralized overlays will realize free access to fully peer-reviewed literature.

3: The RePEc database

RePEc is much less known than arXiv and it is also less well understood. There are two reasons for that. First, it is limited to the economics discipline. Second its business model is more abstract.

Historically, RePEc grew out of the NetEc project. It was started by Thomas Krichel in February 1993. In April 1993 he opened a tiny collection of a electronic papers on an ftp server with a gopher interface operating at Manchester Computing. However, it was never the business plan of NetEc to become an electronic working paper archive for the discipline. Instead the idea was to collect data about printed and electronic papers as published by various sites so that they could be searched together on a single site. It was a library of bibliographic data rather than a archive. The project developed mirror sites in Japan (July 1996) and in the US (January 1997). In 1997, the RePEc dataset was created by the NetEc project, and two other projects that were active in the area, DEGREE and S-WoPEc. These projects agreed to exchange metadata in a common, purpose-built format called ReDIF. This metadata are stored on a file system following a simple protocol called the Guildford protocol. Harvesting software is used to collect the data. Shortly after the implementation of the protocol, several user services appeared that were built on the data. In 1998, data about Economics institutions was integrated into the database. In 2000, a personal registration service was opened. It allows persons to register associations between them and the document and institution data in the database. At the time of writing, it has over 200 archives that contribute metadata, and ten different user services operating in seven countries. There are about 55,000 downloadable paper cataloged in RePEc. The NetEc project received £129,000 funding from the Joint Information Systems Committee (JISC) of the UK Higher Education Funding Councils. It now runs without any external funding. Running such a large-scale operation with volunteer power only is a remarkable technical and organizational achievement.

RePEc is not only a database of papers in economics, but it also contains data about economics institutions and academic economists. The registration of institutions is accomplished through the EDIRC project. The acronym stands for "Economics Departments, Institutions and Research Centers". This dataset has been compiled by Christian Zimmermann, an Associate Professor of Economics at Unversité du Québec à Montréal on his own account, as a public service to the economics profession. The initial intention was to compile a directory with all economics departments that have a web presence. Since there are many departments that have a web presence now, a large number are now registered, about 6,000 of them at the time of this writing. All these records are included in RePEc. For all institutions, data on their homepage is available, as well as postal and telephone information.

EDIRC is mainly linked to the rest of the RePEc data through the HoPEc personal registration service. This service can be used by economists to register themselves as authors of the documents that are contained in RePEc. To date 10% of all papers have at least one of the authors who is a registered person. The HoPEc registrations will in the future be used for building a collection of papers held in the homepages of these authors. Already now, the collection is used to link from the papers of authors to their homepage and for the provision of current contact information. Clearly, current contact information may be quite different from the contact information in the bibliographic data.

Thus RePEc is probably the most ambitious project in Internet scholarly commonication to date. The final aim is that every author, institution and document in economics will be registered in a database. Thus the project will need constant attention and never be finished. The project has to levy volunteer efforts of academics to supply data. Recent efforts to improve the collection of access and download statistics across user services through the LogEc project. It aims to provide academics with direct evidence of how well the system disseminates papers. Work is currently under way to build a citation linking system. This will allow extensions to HoPEc to collect citation evidence as well, enabling registered authors to directly access information on which of their papers have been cited, whom they have be been cited by etc. From experience, we know that authors are very interested in that type of information.

4: Centralized as decentralized archives

For a long time arXiv had virtually a monopoly position in the author self-archiving world. Its centralized discipline-specific model--where all papers that are relevant to a certain discipline are stored on one server--became the only recognized business model for author self-archiving. Two important points were completely overlooked at that time.

First, the centralization of arXiv was a gradual process. Before 1994, archives for some new subject areas were started at other sites. These sites used the same software as arXiv. In November 1994 the data from the remote sites were moved to the central site, and the remote sites became mirrors. The reason for this reorganization was the need for stability of organization and access.

arXiv is by far the largest--in terms of freely available scientific papers--author self-archiving project. While it is possible that the arXiv model could be successfully applied to all other disciplines, there are good reasons to question this. There have been attempts to emulate the success of arXiv by building discipline-based archives for other disciplines. Two working examples are CogPrints at the University of Southampton, since 1996, and the Economics Working Paper Archive at the University of Washington at St. Louis since 1993. Neither has grown beyond 1,500 documents. Both initiatives have been run by members of the commonity they serve. Therefore they can not be dismissed as failing because they are foreign to the discipline and not sensitive to its particular needs. There is no satisfactory answer as to the reason why these archives have failed. While there are specific answers that one can find for each individual attempt, there is not enough commonality for a simple theory of the failure of centralization that would hold across many projects.

Our theory of the emergence of centralized or decentralized archive systems is that it depends on the commonication pattern prior to the ubiquitous access and deployment of computer networks. Computer Science, for example, has a tradition of tech reports. These are reports of recent research findings that are issued by departments prior to their formal publication in a peer-reviewed journal. This is exactly the same channel as the working papers in Economics. By contrast, much of Physics and Mathematics have a preprint tradition. In a preprints system, the tradition is for the author to circulate a recent paper privately, without the intervention of the department. Thus it was more natural for them to submit to a cross-departmental, central, discipline-focused archive. What will happen in the disciplines that neither have a working paper nor a preprint tradition is not clear. We advance the hypothesis that neither the centralized nor the decentralized discipline-specific systems will find much acceptance. Instead, a cross-disciplinary institutional archiving strategy may be more appropriate.

5: The situation in Computing, Library and Information Sciences

The dichotomy between centralized and decentralized archive surfaced in the history of self-archiving in Computing, Library and Information Science. The computer science (CS) commonity has naturally been one of the first be be aware of the availability of the Internet for scholarly dissemination purposes. Maly et al. (1993) note that

Shortly before Snowbird'92 ... it appears that there are at least thirty different efforts going on, and at least as many departments would like to have electronic distribution but as yet do not.

We note that as predicted from the pre-Internet days, departments, rather than individual authors, were the movers. Maly et al. (1993), in the next paragraph, note that

We quickly realized that having thirty or so different systems is not necessarily better than having no solution at all.

Clearly the commonity felt a need for an organized distribution. The solution that was adopted was based in the WAIS (Wide Area Information Server) software. A central installation of this software could build a central index of bibliographic data. Users would be using software implementing standard Internet protocols--note that we are talking about pre-web days here--to commonicate with the central server. At the departmental level a software called "techrep" would be used to maintain a bibliographic file that would be periodically fetched by the central index, and to maintain a collection of full-text documents. This was the setup used in the WATERS project. The project was funded by the NSF, the final report is Maly et al. (1994).

Another project running at roughly the same time was the DARPA-funded CSTR project. The pioneering contribution of this project was the development of the Dienst protocol, as initially described by Davis and Lagoze (1994).

Dienst is a purpose-built protocol for the management of a distributed digital library. It provides for an open interface for a digital library. The interface uses the http protocol to commonicate to Dienst servers. Thus any web client can issue Dienst requests to a server. Dienst servers respond to "verbs" that request a certain function to be performed. The http request sent to the server contains the verb and additional attributes to the verb. One important additional attribute is the version number of the verb. Since Dienst version 4, each verb has an explicit version number. Individual verbs can be be updated, and the protocol as such will not be broken.

For convenience of exposition and housekeeping Dienst verbs are grouped by "services", i.e. groups of verbs that perform related functions. Quoting from the protocol document for for version 4, Davis et al. (1998), Dienst version 4 has the following services:

Repository Service "allows a given document to be stored in many different formats, and provides messages to obtain the document or pieces of the document in any of the stored formats."

Meta Service lists publishers, repositories, index servers, etc.

Index Service "searches a set of descriptions of documents and return handles for those that match."

UI Service "presents information from the other services in a human readable form."

LibMgt Service allows for the management of documents in the repository.

Registry Service "provides a persistent database of Dienst users and is required by the subscription service." However the subscription service is not defined in the protocol document.

Info Messages allow to find out information about Dienst servers and services.

There is an unversioned rewrite of Dienst in XML available on the NCSTRL site, but as far as we are aware of there is no implementation software for it.

Early Dienst versions used distributed searching, but later versions introduced index servers for good performance. From a technical point of view, Dienst is a powerful tool in that it allows for an open, object oriented interface, that can be extended locally. For example, a site may provide its own verbs in the UI service to allow for additional functionality. However from an organizational management point of view, running a Dienst server required personnel that were fairly computer literate. Thus although the CSTR project's methods were more elegant and complete, they required more input from departments than WATERS did.

In 1995, the CSTR and WATERS projects merged to become the Networked Computer Science technical Report Library, NCSTRL, pronounce "ancestral". NCSTRL came in two flavors, "lite" and "standard". Lite sites essentially continued to operate as they had done in the days of WATERS. The central lite site runs a Dienst server. Standard sites runs a Dienst server directly. About half the papers in NCSTRL were in lite sites.

In May 1997 the Association for Computing Machinery formed a committee to deal with computing research papers. They decided in favor of a hybrid between arXiv and NCSTRL called the Computing Research Repository (CoRR). ACM Working Group on Publication Repositories (1998) elegantly summaries the design choices made for CoRR by the committee, therefore we do not repeat them here.

The original announcement email for CoRR, from September 1998, said

Researchers have made their papers available by putting them on personal web pages, departmental pages, and on various ad hoc sites known only to cognescenti. Until now, there has been no single repository to which researchers from the whole field of computing can submit reports.
That is about to change. Through a partnership of ACM, the Los Alamos e-Print archive, and NCSTRL (Networked Computer Science Technical Reference Library), an online Computing Research Repository (CoRR) has been established. The Repository has been integrated into the collection of over 20,000 computer science research reports and other material available through NCSTRL and will be linked with the ACM Digital Library. The Repository is available to all members of the commonity at no charge.

The statement "this is about to change" has been somewhat premature, if it is understood to apply to the whole of the paragraph. The promise of a link with the ACM digital library was of course one of they key promises of the new archive, associating it with the most important scholarly society in the field. But that "link" never happened. The same passage on the home page of CoRR now reads

Computing research relies heavily on the rapid dissemination of results. As a result, the formal process of submitting papers to journals has been augmented by other, more rapid, dissemination methods. Originally these involved printed documents, such as technical reports and conference papers. Then researchers started taking advantage of the Internet, putting papers on ftp sites and later on various web sites. But these resources were fragmented. There was no single repository to which researchers from the whole field of computing could submit reports, no single place to search for research results, and no guarantee that information would be archived at the end of a research project.
This changed in September 1998. Through a partnership of ACM, the Los Alamos e-Print archive, and NCSTRL (Networked Computer Science Technical Reference Library), an online Computing Research Repository (CoRR) has been established. The CoRR is available to all members of the commonity at no charge.

In the meantime, CoRR has not been doing well. In 1998, despite the short time that it was available, 223 papers were submitted. In 1999 277 papers made it to CoRR, in 2000 there were 444 and for 2001 a figure of about 500 is expected. To make matters worse the user interface sprinkles these paper onto 34 subject categories. Thus each of them gets a rather poor endowment of papers, which adds to the general impression that the acceptance of the service is low. None of the plans that the homepage states are "in discussion", as mentioned on the web site, have currently been achieved.

In the meantime, the existence of CoRR may have created an expectation in the Computer Science commonity that NCSTRL were running out of steam. Unfortunately, such predictions can be self-fullfilling. At the time of writing this paper, the NCSTRL homepage has a note

NOTICE: This site is being maintained on a legacy basis and support for it is minimal. As a result, some of the functionality is limited. We will be posting a message in the near future about the future status of this site. Posted 2000-03-05.

This message that was supposed to be forthcoming is not found on the site. Despite watching relevant mailing lists closely, the authors have not found that announcement. It seems that NCSTRL, if not terminated officially, is in serious crisis.

The failure of NCSTRL is worrying for two reasons. First it is a failure of work for the Computer Science area. Many digital library efforts find it very hard to access appropriate computing expertise at times when such expertise is in high demand by other sectors that have more financial resources. But such expertise should have been plentiful in Computer Science departments. Second, the failure comes after years of development and on a collection of considerable scale. NCSTRL started as early as 1993 and has about 10,000 papers available in electronic format. After arXiv and RePEc, this is the third-largest collection of freely downloadable documents.

We suspect that the failure is not of a technical but managerial nature. As outsiders, we can only make some speculations about what the problems are.

One possible explanation in the double nature of NCSTRL. It was conceived as a digital document collection and a digital library test bed. It those circumstances, there is constant experimentation. To allow several versions of Dienst to run on servers, each verb in Dienst had a version number, and more recent versions of the Dienst software could still execute the old verbs. Nevertheless in order to install more recent versions of Dienst software, archives had to do extra work. NCSTRL could not simply be a service without research components without compromising the funding position of the collection. ArXiv has continued funding as a service. RePEc has no funding and is vulnerable because of that.

Another possible reason for failure may have been the NCSTRL decentralized architecture and sophisticated protocol. People operating the servers needed to be highly computer literate. To obtain such staff at reasonable cost, participating CS departments hired graduate students. The problem with this strategy is that it took some time to learn to do the job. When they were trained, the next person took over. This implies discontinuities in archive management. This is also a problem for RePEc but the technical architecture of RePEc is much simpler, and therefore the problems are much smaller. In addition, there is no special server needed to run a RePEc archive, everything can be done "manually" an a file system that is accessible by ftp and/or http.

Let us summarize. There are more Computer Science papers on the Internet available than before. There have been some attempt to organize them in formal digital libraries. These have failed. They were technologically wizard pieces but social misfits, i.e. they were not accepted be the commonity. 10 years after the Snowbird'92 conference, there still is no formal author self-archiving system in computing that has wide-spread acceptance. The closest thing that the computer scientists have to an author self-archiving system is the CiteSeer system. It collects papers autonomously, applies a few heuristics to see if it looks like scientific paper, extracts the references where it finds links to further papers etc. It is essentially an automated system, a very powerful one for resource discovery but of limited value as an alternative scholarly commonication outlet.

In Library and Information Sciences the situation is different. Since 2000, there is a project called ReLIS. It is a RePEc clone. It has a user interface called DoIS. It is a welcome step in the right direction. The content is still quite thin. If this project enlarges to Computer Science, it could become an important player.

6: The Open Archives Initiative

In the Summer of 1999, Van de Sompel, Krichel, Nelson et al. (2000), conducted an experimental study to set up a common search interface to e-print archives, known as the Universal Preprint Service (UPS) prototype. They obtained permission from archives to there have been recent moves towards improving the interoperability of e-print archives.

A call for a UPS meeting was issued by Paul Ginsparg, Rick Luce and Herbert Van de Sompel. The motivation was to improve the interoperability of e-print initiatives. At that time, these were, by order of size, arXiv, RePEc, NCSTRL, and CogPrints. In addition, the electronic dissertation network NDLTD, the digital Highwire Press, the Physics reference service SLAC-SPIRES, and a few others were represented at the meeting. An initial proposal was on the table for a "Santa Fe Convention" that would outline usage and access conditions to metadata of archives. This proposal was rejected as being too legalistic. Instead it was decided that the permission to access should be given indirectly, through the construction of a technical interface that is designed to provide access to metadata. This means that a data which wanted to give access to its metadata to a service provider. Thus the RePEc business model with its separation of data providers and service providers was taken as an implicit reference model.

The meeting did accomplish a rough consensus on the Santa Fe Convention, later published as Van de Sompel and Lagoze (2000). The Santa Fe convention included a harvesting protocol for metadata, using a subset of Dienst. It also included a metadata set that was mandatory as a common metadata set to be exchanged among compliant eprint archives. This Santa Fe metadata set was only to be used as a minimum metadata set. The convention explicitly welcomes other metadata sets to be deployed.

The UPS initiative was soon renamed the Open Archives Initiative. While the OAI started as an eprint interoperability framework, soon keen interest from other commonities in the protocol appeared. At a meeting of the OAI technical technical committee in September 2000. A new protocol document was outlined that can be used for the interoperability of any type of digital library. The compatibility of the metadata harvesting protocol with Dienst was broken. The eprint-specific Santa Fe metadata set was deprecated. A working party for a new eprint specific metadata was formed. The Academic Metadada Format, as discussed in the next section, is an output of that working group.

The key deliverable from the OAI is the Open Archives Protocol for Metadata Harvesting. It is designed to provide a low-barrier to interoperability. Key features include:

support for multiple metadata formats,

requirement for Dublin Core (DC) metadata as means of global interoperability,

use of Hypertext Transfer Protocol (HTTP) transport, and

use of Extensible Markup Language (XML) encoding.

It has been suggested that the OAI framework be used to support institution-based digital archives. These archives will contain research results produced in an institution--from all disciplines--and archived in the library. The ARNO project (Academic Research in the Netherlands Online) is a small-scale, but pioneering effort to do just this. It remains to be seen how successful it will be. For libraries to take on the rôle to archive, rather than to make material available that is produced by intermediaries outside academia, is a big step. It implies a change in their perceived rôle.

If this change in the rôle of libraries is achieved, there can be little doubt that the establishment of institution-based archives is a better and faster way to free scholarly literature than waiting for each discipline to embrace author self-archiving in its own proprietary way. We expect that, whatever the final outcome, there will be increasing numbers of both institution- and discipline-based archives in the near term. We think that the degree of interoperability and cooperation between these archives will strongly influence their success. In our opinion, the most remarkable contribution of the OAI to date, is not the harvesting protocol itself, but it is the spirit of collaboration that it has created between different commonities, different archives, and between data and service providers. It is clear that this will be beneficial to the providers of such collections and their users.

7: The Academic Metadata Format

The Academic Metadata Format is described in Brody et al. (2001). The primary aim of AMF is to support the free dissemination of scholarly literature on the Internet. One approach to increasing the amount on freely available scholarly literature is to convince more authors to submit to formal archives, such as arXiv or the RePEc archives. This will only be successful if the benefits of submission of a paper outweigh the cost of submission. The key element to maximizing the benefits of submission is to raise the visibility of the submitted documents. This can be achieved if the submitted documents appear in many high-quality user services. The Open Archives Initiative facilitates inclusion in many user services. However the quality of these services will crucially depend on the quality of the metadata that is supplied to them. As soon as a user service does more than full-text indexing, it will need a large stock of high-quality metadata. AMF supports this approach by providing a format which can facilitate the exchange of high-quality metadata.

A second approach is to help improve the dissemination infrastructure of the homepage publication movement. This necessitates improvement in the quality of the metadata available through such pages by supporting simple, author-supplied metadata. These data can then be fed into the Open Archives system using an intermediate provider. For this or a similar approach to work, there must be a basic level of metadata that is simple and intuitive for a non-specialist.

Simplicity is a key design goal for AMF. AMF does not ignore existing efforts. It adopts appropriate elements of standard vocabularies that have been proposed, mainly by the resource discovery metadata commonity. These are the Dublin Core Metadata Element Set DCMI (1999), the Dublin Core Qualifiers DCMI (2000b), the DCMI Type Vocabulary DCMI (2000a), the the OpenURL syntax, Van de Sompel, Hostenbach and Beit-Aire (2000) and vCard, Howes, Smith and Dawson (1998).

As far as the syntax is concerned, the AMF design is constrained by OAI adopting XML as the base format. AMF makes no use the Resource Description Framework (RDF) promoted by the World Wide Web Consortium as set out in Lassila and Swick (1999). It is not clear if this is a good design choice. Implementing RDF would be a bit harder on providers of metadata, but it would allow the use of software tools that manipulate RDF.

AMF aims to describe the following entity classes

resources
groups of resources
people
organizations

AMF allows for specification of properties for instances of the entity classes, and the specification of relations between them. As far as the general syntax is concerned, AMF borrows from natural language. The XML elements that represent the fundamental entity classes are called "nouns". XML elements that give properties to nouns are called "adjectives". Some of them admit other adjectives as children, but most of them admit no children. Just as in natural language, adjectives are used to qualify a noun. To make a relationship between two nouns, AMF uses "verbs". A verb must have at least one noun as child element. Just as in natural language, verbs are used to relate two nouns. An example is useful here.

<text>
 <title>Academic self-documentation
 </title>
 <hasauthor>
  <person>
    <name>Thomas Krichel</name>
  </person>
 </hasauthor>
 <hasauthor>
  <person>
    <name>Simeon M. Warner</name>
  </person>
 </hasauthor>
</text>

This example uses the text and person nouns, the hasauthor verb and the adjectives title and name. The use of relational features is further enhanced through the use of identifiers. Each noun may be given an identifier. It may also refer to an identified noun as an alternative description. Thus the previous example--leaving out Thomas for brevity--may be written as

<text id="mitaka">
 <title>Academic self-documentation 
 </title> 
  <hasauthor>
   <person ref="simeon_warner"/>
  </hasauthor> 
</text>
 <person id="simeon_warner">
  <name>Simeon M. Warner</name>
 <isauthorof>
  <text ref="mitaka"/> 
 </isauthorof>
</person>

The interesting feature of this example is that the two records may be maintained by different persons, in different files. The record about the person may be maintained by the person herself. RePEc already uses personal records just like those in the example, that are created and maintained by registrants, see Barrueco Cruz, Klink and Krichel (2000). The id and ref constructs may be used in many other circumstances. For example they can be used to add records to classify identified documents in subject classification schemes.

The most important innovation of AMF is not of a technical nature. Rather, it is the vision behind the descriptive model. It emphases the academic world as a process. Its description relates resources to non-resources. This is in contrast with the view of traditional library metadata effort, that focus exclusively on resources.

8: Steps forward

Even if nothing more is done, there will be a many papers freely available online. Most of them will be made available through the homepage of their authors. They may be withdrawn at any time. There is no bibliographic organization of these papers other than ones organized by machine. Papers in the free layer can be found through general Web search engines, or through a specialized engine like ResearchIndex. But, since these papers are in places where they can be modified by authors, it does not appear to be possible to base a certification system on these papers. Therefore, it is likely that a toll-gated layer of quality controlled, final publication will persist.

This default scenario has been defended by Arms (2000). He envisages the co-existence of an expensive layer of a research library that is powered by humans, with the extensive quality control of the data, and a free layer that is essentially computer generated. Author pressure, he speculates, will make a lot of research papers openly available. But the bibliographic layer, since this is costly to produce, is not likely to be free. Some elements of the construction of the bibliographic interface can not be fully automated. This for example concerns the removal of duplicates, dealing with moved, changed or merged collections, maintaining a correct set of author data etc.

On the other hand, it appears unlikely that Computer Science will go along with organization in a centralized archive like CoRR or a successor to it. This would have several advantages. The quality control of the papers can be done in overlay services. These overlay service may contain links to papers on CoRR site that have been found fit for inclusion in the overlay service. They may also contain reviews of the free material. Another important advantage would be that the digital preservation would be much easier. But despite these advantages, if the commonity does not bother to use such an archive, then these advantages can not be reaped.

All we can really hope for is an intermediate scenario, with a set of open archives that overlap and/or cooperate. There are already some initiatives that could benefit from further collaboration. These include the ResearchIndex service that we have already mentioned, the DBLP bibliographic index, and the Collection of Computer Science Bibliographies. These holdings of these collections could be combined and duplicates removed. Exporting such a collection through an open archive would open the possibilities for extended services, such as personal registration, access logging, and peer-review.

9: Conclusions

Computer geeks have managed to make whole operating systems of computers available on the WWW. At the same time academics have not been able to make a coherent index to their works available, let alone organize the free availability of the full text of these works. Building such a bibliography is--from a technical point of view--a much simpler task than the distributed maintenance of a computer operating system, yet it seems to be more difficult to do to those who are supposed to do it.

For Computing, Library and Information Sciences, this pathetic state of affairs is particularly embarrassing. These disciplines wish to find out--among other things--how knowledge is efficiently managed and disseminated. If their commonities do not manage to build an efficient system for themselves the methods that they propose will have little credibility with other commonities.

While there is more and more freely accessible academic content on the Internet, the organization of that content is much less useful than the organization of content in formal archives and libraries. The Open Archives Initiative (OAI) has developed protocols that improve on this state of affairs by permitting interoperability between archives. AMF is a able to encode academic output as a process, rather than a set of resources. Large collections of AMF data will open the door to applications in the area of quantitative assessment of academic work. All this has very exciting potential. At the same time recent negative developments like the demise of NCSTRL and the stagnation of CoRR suggest that new entrants should come to fill the gap.

To stimulate the provision of contents, the collection must be conceived as an advertisement of the work of its contributors. To do that, the contributing person or institution must be at the center of the collection effort. We need a break with the tradition of libraries. In a traditional library setting, the work is at the center of the descriptive effort. In an academic self-publishing setting, the contributor, rather than the works created by the contributor must be the center of the collection effort.

The authors are grateful for comments on this paper by an anonymous referee. Thomas Krichel is especially grateful for the hospitality of the Department of Economics at the University of Surrey, where much of his contribution to this paper was made.