Academic self-documentation: which way
forward for computing, library and information
science?
Academic self-documentation: which way
forward for computing, library and information
science?
5 November 2001
Abstract
There has been a lot of discussion about the potential for free access
to scholarly documents on the Internet. At the turn of the century,
there are two major initiatives. These are arXiv, which covers
Physics, Mathematics and Computer Science and RePEc, which covers
Economics. These initiatives work in very different ways. This paper
is the fruit of collaboration between authors working for both
initiatives. It therefore reflects the perspective of people working
to achieve change, rather than an academic perspective of pure
observation.
We first introduce both arXiv and RePEc, to highlight the basic
dichotomy between centralized and decentralized archives. We then
discuss at some length the history of self-archiving in Computer
Science. We introduce recent developments, the Open Archives
Initiative and the Academic Metadata Format, focusing on social rather
than technical issues. We then discuss how, building on these
developments, the self-archiving situation in Computing and Library
and Information Science can be improved.
This
paper is available in PDF for A4
paper, and for
letter size
paper.
1: Introduction
Electronic commerce over the Internet is now commonplace. In
the majority of cases, delivery of the merchandise occurs off-line.
However, purely informational commodities--such as statistical
data or pornographic pictures--can both be contracted upon and
delivered over the Internet. That affords the holders of such
commodities the opportunity to directly contract with customers in the
way that was not possible off-line. The new medium thus provides an
opportunity for disintermediation.
In the academic world, the debate about the possible extent of
disintermediation has concentrated on the rôle of academic
publishing. A large part of academic writing attracts no payment from
publication. For the sake of simplicity, this paper deals exclusively
with scholarly works for which the author receives no payment. These
will be referred to as "research papers" or "papers" for short.
It is further assumed that the advent of the Internet will not change
the reward structure in the academic world. We assume that authors
will still want to write without payment. Their aims will be to have a
wide dissemination of the output and to achieve a mark of peer
recognition.
It has been proposed, most vociferously by
Harnad
(1995) and in many papers since,
that the Internet will lead to free access to academic papers. In
1995 two large-scale initiatives were already well under way to work
on realizing this vision. These were the arXiv archive for Physics,
and the RePEc dataset for Economics. The Asian countries are
particular benefactors from these initiatives. With a budget that is
often too small to cover the cost of many scholarly journals, many
Asian universities find that the access of papers on the Internet is
an invaluable resource. But the corpus of papers is dispersed on the
Internet. This makes free online papers hard to find. Many may be
modified or withdrawn at any time. This makes it hard to cite and
evaluate them. ArXiv and RePEc are the two largest initiatives
combating these problems. This paper is the fruit of collaboration
between authors from both initiatives. We review arXiv in
Section 2 and we examine RePEc in
Section 3 . In a brief Section
4 we discuss the distinction between centralized
and decentralized archives. This serves as a background to
Section 5 where we present the history of
self-archiving systems in computer science. The following two sections deal
with recent developments. We examine aspects of the Open
Archives Initiative (OAI) that are
relevant to the author self-archiving problem in Section
6 . In Section 7 we
introduce the Academic Metadata Format that is used for documenting
academia in the OAI. In Section 8 we will
make suggestions for further work. Section
9 concludes.
2: The arXiv archive
The arXiv e-print archive is the largest and best-known archive of
author self-archived scholarly literature. It is discipline-based. It
centralized in the sense
that all submissions and the master database are at
one site. Here we briefly review the history of arXiv and then
discuss its place in the scholarly publishing environment.
The arXiv e-print archive, formerly known as xxx, was started in
August 1991 by Paul Ginsparg at Los Alamos National
Laboratory. The first subject area was
theoretical high-energy physics, `hep-th' and it was initially little
more than an email reflector for paper exchange. In 1992 an ftp
interface was added and the archive expanded to other areas of high
energy physics. Development since then has seen steady expansion of
the subject areas to cover all of physics, mathematics, computer
science and non-linear systems; and the addition of new interfaces and
facilities as new technologies have become available. Significant
developments have included: web interface (December 1993); automatic
PostScript generation from TeX source (June 1995); PDF generation
(April 1996); and web upload (June 1996). Recently, arXiv has been at
the forefront of the Open Archives Initiative (OAI), which we will be
discussing later. During the Summer of 2001, arXiv moved from the
Los Alamos National Laboratory to
Cornell University .
arXiv serves an estimated 70,000 users in over 100 countries. There
were about 13,000,000 downloads of papers in 2000. There were over
30,000 submissions in 2000. The number of new submissions is growing
approximately linearly, with about 3,500 additional new submissions
each year. The growth in submissions has had little impact on the
administration of the arXiv although there is an increasing reliance
on moderators to assure appropriateness. More that 98% of the
submissions are entirely automated: 68% of them via the web, 27% via
email and 5% via ftp. It is interesting to note that if just 15
minutes were required for each submission then a full-time staff of 7
people would be required to deal with new submissions. Instead, arXiv
uses less than one full-time equivalent to deal with day-to-day
operations.
There are two important factors behind the growth of arXiv. The
high-energy physics commonity uses the TeX text formatting system
almost exclusively, and this has been very convenient for arXiv. arXiv
does not accept preprocessed TeX submissions. Authors must submit
the source. This allows generation of various types of output
including DVI, PostScript in several flavors, and PDF. Expansion into
other areas of physics means that there are now an increasing number
of non-TeX submissions and this trend is sure to
continue. Unfortunately, many common word-processing packages produce
very inefficient and sometimes low-quality output unless used
expertly. Experience shows that PostScript or PDF submissions require
greater screening effort than TeX submissions. This is an example of
how the physics and mathematics commonities differ from other
commonities in a way that has favored author self-archiving.
A second factor behind the growth of arXiv is long-term funding. The
Los Alamos site is funded by the US National Science Foundation and
the US Department of Energy. Its 15 mirror sites around the world are
funded independently, the cost is just a few thousand dollars for a
new machine every few years and a small amount of system
administration effort.
arXiv has not been an academic exercise, it was started from inside
the commonity it aimed to serve. At all stages of expansion to cover a
wider subject area, arXiv has been guided and promoted by members of
the new fields. Nowadays, some conventional publishers tacitly
acknowledge the legitimacy of arXiv by accepting submissions where the
author simply quotes an arXiv identifier. Policies vary on whether
the publisher permits the author to update the arXiv version to
reflect changes made during the refereeing process. However, authors
often ignore any prohibitions. In the longer run, there may emerge a
scenario where authors and journals rely on arXiv to providing a
digital archive. Journals will then simply contain reviews of papers
that are deposited at arXiv.
Geometry and
Topology is an example of such an
"overlay journal". The presence of a central archive and a range of
decentralized overlays will realize free access to fully peer-reviewed
literature.
3: The RePEc database
RePEc is much less known than arXiv and it is also less well
understood. There are two reasons for that. First, it is limited to
the economics discipline. Second its business model is more abstract.
Historically, RePEc grew out of the NetEc project. It was started by
Thomas Krichel in February 1993. In April 1993 he opened a tiny
collection of a electronic papers on an ftp server with a gopher
interface operating at Manchester
Computing. However, it was never the business
plan of NetEc to become an electronic working paper archive for the
discipline. Instead the idea was to collect data about printed and
electronic papers as published by various sites so that they could be
searched together on a single site. It was a library of bibliographic
data rather than a archive. The project developed mirror sites in
Japan (July 1996) and in the US (January 1997). In 1997, the RePEc
dataset was created by the NetEc project, and two other projects that
were active in the area, DEGREE
and S-WoPEc . These projects agreed to
exchange metadata in a common, purpose-built format called
ReDIF . This
metadata are stored on a file system following a simple protocol
called the Guildford
protocol. Harvesting
software is used to collect the data. Shortly after the
implementation of the protocol, several user services appeared that
were built on the data. In 1998, data about Economics institutions was
integrated into the database. In 2000, a personal registration service
was opened. It allows persons to register associations between them
and the document and institution data in the database. At the time of
writing, it has over 200 archives that contribute metadata, and ten
different user services operating in seven countries. There are about
55,000 downloadable paper cataloged in RePEc. The NetEc project
received £129,000 funding from the Joint Information Systems
Committee (JISC ) of the
UK Higher Education Funding Councils. It now runs without any
external funding. Running such a large-scale operation with volunteer
power only is a remarkable technical and organizational achievement.
RePEc is not only a database of papers in economics, but it also
contains data about economics institutions and academic economists.
The registration of institutions is accomplished through the
EDIRC
project. The acronym stands for "Economics Departments, Institutions
and Research Centers". This dataset has been compiled by Christian
Zimmermann, an
Associate Professor of Economics at Unversité du Québec à
Montréal on his own account, as a public
service to the economics profession. The initial intention was to
compile a directory with all economics departments that have a web
presence. Since there are many departments that have a web presence
now, a large number are now registered, about 6,000 of them at the
time of this writing. All these records are included in RePEc. For all
institutions, data on their homepage is available, as well as postal
and telephone information.
EDIRC is mainly linked to the rest of the RePEc data through the
HoPEc personal
registration service. This service can be used by economists to
register themselves as authors of the documents that are contained in
RePEc. To date 10% of all papers have at least one of the authors who
is a registered person. The HoPEc registrations will in the future be
used for building a collection of papers held in the homepages of
these authors. Already now, the collection is used to link from the
papers of authors to their homepage and for the provision of current
contact information. Clearly, current contact information may be quite
different from the contact information in the bibliographic data.
Thus RePEc is probably the most ambitious project in Internet
scholarly commonication to date. The final aim is that every author,
institution and document in economics will be registered in a
database. Thus the project will need constant attention and never be
finished. The project has to levy volunteer efforts of academics to
supply data. Recent efforts to improve the collection of access and
download statistics across user services through the
LogEc project. It aims to provide academics
with direct evidence of how well the system disseminates papers. Work
is currently under way to build a citation linking system. This will
allow extensions to HoPEc to collect citation evidence as well,
enabling registered authors to directly access information on which of
their papers have been cited, whom they have be been cited by etc.
From experience, we know that authors are very interested in that type
of information.
4: Centralized as decentralized archives
For a long time arXiv had virtually a monopoly position in the author
self-archiving world. Its centralized discipline-specific
model--where all papers that are relevant to a certain
discipline are stored on one server--became the only recognized
business model for author self-archiving. Two important points were
completely overlooked at that time.
First, the centralization of arXiv was a gradual process. Before
1994, archives for some new subject areas were started at other
sites. These sites used the same software as arXiv. In November 1994
the data from the remote sites were moved to the central site, and the
remote sites became mirrors. The reason for this reorganization was
the need for stability of organization and access.
arXiv is by far the largest--in terms of freely available
scientific papers--author self-archiving project. While it is
possible that the arXiv model could be successfully applied to all
other disciplines, there are good reasons to question this. There
have been attempts to emulate the success of arXiv by building
discipline-based archives for other disciplines. Two working examples
are CogPrints at the University
of Southampton, since 1996,
and the Economics Working
Paper Archive at the University of
Washington at St. Louis since 1993. Neither has grown beyond 1,500
documents. Both initiatives have been run by members of the commonity
they serve. Therefore they can not be dismissed as failing because
they are foreign to the discipline and not sensitive to its particular
needs. There is no satisfactory answer as to the reason why these
archives have failed. While there are specific answers that one can
find for each individual attempt, there is not enough commonality for
a simple theory of the failure of centralization that would hold
across many projects.
Our theory of the emergence of centralized or decentralized archive
systems is that it depends on the commonication pattern prior to the
ubiquitous access and deployment of computer networks. Computer
Science, for example, has a tradition of tech reports. These are
reports of recent research findings that are issued by departments
prior to their formal publication in a peer-reviewed journal. This is
exactly the same channel as the working papers in Economics. By
contrast, much of Physics and Mathematics have a preprint
tradition. In a preprints system,
the tradition is for the author to circulate a
recent paper privately, without the intervention of the
department. Thus it was more natural for them to submit to a
cross-departmental, central, discipline-focused archive. What will
happen in the disciplines that neither have a working paper nor a
preprint tradition is not clear. We advance the hypothesis that
neither the centralized nor the decentralized discipline-specific
systems will find much acceptance. Instead, a cross-disciplinary
institutional archiving strategy may be more appropriate.
5: The situation in Computing, Library and Information Sciences
The dichotomy between centralized and decentralized archive
surfaced in the history of self-archiving in Computing, Library and
Information Science. The computer science (CS) commonity has
naturally been one of the first be be aware of the availability of the
Internet for scholarly dissemination
purposes. Maly
et al. (1993) note that Shortly
before Snowbird'92 ... it appears that there are at least thirty
different efforts going on, and at least as many departments would
like to have electronic distribution but as yet do not.
We note that as predicted from the pre-Internet days, departments,
rather than individual authors, were the
movers. Maly
et al. (1993), in the next paragraph, note
that We quickly realized that having thirty or so different
systems is not necessarily better than having no solution at all.
Clearly the commonity felt a need for an organized distribution.
The solution that was adopted was based in the WAIS (Wide Area
Information Server) software. A central installation of this software
could build a central index of bibliographic data. Users would be
using software implementing standard Internet protocols--note
that we are talking about pre-web days here--to commonicate
with the central server. At the departmental level a software called
"techrep" would be used to maintain a bibliographic file that would
be periodically fetched by the central index, and to maintain a
collection of full-text documents. This was the setup used in the
WATERS project. The project was funded by the NSF, the final report
is
Maly et al. (1994) .
Another project running at roughly the same time was the DARPA-funded
CSTR project. The pioneering contribution of this project was the
development of the Dienst protocol, as initially described by
Davis and Lagoze (1994) .
Dienst is a purpose-built protocol for the management of a distributed
digital library. It provides for an open interface for a digital
library. The interface uses the http protocol to commonicate to Dienst
servers. Thus any web client can issue Dienst requests to a
server. Dienst servers respond to "verbs" that request a certain
function to be performed. The http request sent to the server
contains the verb and additional attributes to the verb. One important
additional attribute is the version number of the verb. Since Dienst
version 4, each verb has an explicit version number. Individual verbs
can be be updated, and the protocol as such will not be broken.
For convenience of exposition and housekeeping Dienst verbs are
grouped by "services", i.e. groups of verbs that perform
related functions. Quoting from the protocol document for
for version 4,
Davis et al. (1998) ,
Dienst version 4 has the following services:
Repository Service "allows a given document to be stored in many different formats, and provides messages to obtain the document or pieces of the document in any of the stored formats."
Meta Service lists publishers, repositories, index servers, etc.
Index Service "searches a set of descriptions of documents and return handles for those that match."
UI Service
"presents information from the other services in a human readable form."
LibMgt Service allows for the management of documents in the
repository.
Registry Service "provides a persistent database of Dienst users and is required by the subscription service." However the subscription service is
not defined in the protocol document.
Info Messages allow to find out information about Dienst servers
and services.
There
is an unversioned rewrite of Dienst in XML available on the NCSTRL
site, but as far as we are aware of there is no implementation
software for it.
Early Dienst versions used distributed searching, but later versions
introduced index servers for good performance. From a
technical point of view, Dienst is a powerful
tool in that it allows for an open, object oriented interface, that
can be extended locally. For example, a site may provide its own
verbs in the UI service to allow for additional functionality.
However from an organizational management point of view, running a
Dienst server required personnel that were fairly computer literate.
Thus although the CSTR project's methods were more elegant and
complete, they required more input from departments than WATERS
did.
In 1995, the CSTR and WATERS projects merged to become the
Networked Computer Science
technical Report Library, NCSTRL ,
pronounce "ancestral". NCSTRL came in two flavors, "lite" and
"standard". Lite sites essentially continued to operate as they
had done in the days of WATERS. The central lite site runs a Dienst
server. Standard sites runs a Dienst server directly. About half the
papers in NCSTRL were in lite sites.
In May 1997 the Association for Computing
Machinery formed a committee to deal with
computing research papers. They decided in favor of a hybrid between
arXiv and NCSTRL called the Computing Research Repository
(CoRR ).
ACM Working
Group on Publication Repositories (1998)
elegantly summaries the design
choices made for CoRR by the committee, therefore we do not repeat
them here.
The original announcement email for CoRR, from September 1998, said
Researchers have made their papers available by putting them on personal
web pages, departmental pages, and on various ad hoc sites known only to
cognescenti. Until now, there has been no single repository to which
researchers from the whole field of computing can submit reports.
That is about to change. Through a partnership of ACM, the Los Alamos
e-Print archive, and NCSTRL (Networked Computer Science Technical
Reference Library), an online Computing Research Repository (CoRR) has
been established. The Repository has been integrated into the collection
of over 20,000 computer science research reports and other material
available through NCSTRL and will be linked with the ACM Digital Library.
The Repository is available to all members of the commonity at no charge.
The statement "this is about to change" has been somewhat premature, if
it is understood to apply to the whole of the paragraph. The promise of
a link with the ACM digital library was of course one of they key promises
of the new archive, associating it with the most important scholarly
society in the field. But that "link" never happened. The same passage
on the home page of CoRR now reads
Computing research relies heavily on the rapid dissemination of
results. As a result, the formal process of submitting papers to
journals has been augmented by other, more rapid, dissemination
methods. Originally these involved printed documents, such as
technical reports and conference papers. Then researchers started
taking advantage of the Internet, putting papers on ftp sites and
later on various web sites. But these resources were fragmented. There
was no single repository to which researchers from the whole field of
computing could submit reports, no single place to search for research
results, and no guarantee that information would be archived at the
end of a research project.
This changed in September 1998. Through a partnership of ACM, the Los
Alamos e-Print archive, and NCSTRL (Networked Computer Science
Technical Reference Library), an online Computing Research Repository
(CoRR) has been established. The CoRR is available to all members of
the commonity at no charge.
In the meantime, CoRR has not been doing well. In 1998, despite the
short time that it was available, 223 papers were submitted. In 1999
277 papers made it to CoRR, in 2000 there were 444 and for 2001 a
figure of about 500 is expected.
To make matters worse the user interface
sprinkles these paper onto 34 subject categories. Thus each of them
gets a rather poor endowment of papers, which adds to the general
impression that the acceptance of the service is low. None of the
plans that the homepage states are "in discussion", as mentioned on
the web site, have currently been achieved.
In the meantime, the existence of CoRR may have created an expectation
in the Computer Science commonity that NCSTRL were running out of
steam. Unfortunately, such predictions can be self-fullfilling. At the
time of writing this paper, the NCSTRL homepage has a note
NOTICE: This site is being maintained on a legacy basis and support
for it is minimal. As a result, some of the functionality is
limited. We will be posting a message in the near future about the
future status of this site. Posted 2000-03-05.
This message that was supposed to be forthcoming is not found on the
site. Despite watching relevant mailing lists closely, the authors
have not found that announcement. It seems that NCSTRL, if not
terminated officially, is in serious crisis.
The failure of NCSTRL is worrying for two reasons. First it is a
failure of work for the Computer Science area. Many digital library
efforts find it very hard to access appropriate computing expertise at
times when such expertise is in high demand by other sectors that have
more financial resources. But such expertise should have been
plentiful in Computer Science departments. Second, the failure comes
after years of development and on a collection of considerable
scale. NCSTRL started as early as 1993 and has about 10,000 papers
available in electronic format. After arXiv and RePEc, this is the
third-largest collection of freely downloadable documents.
We suspect that the failure is not of a technical but managerial
nature. As outsiders, we can only make some speculations about what
the problems are.
One possible explanation in the double nature of NCSTRL. It was
conceived as a digital document collection and a digital library test
bed. It those circumstances, there is constant experimentation.
To allow several versions of Dienst to run on servers, each
verb in Dienst had a version number, and more recent versions of the
Dienst software could still execute the old verbs. Nevertheless in
order to install more recent versions of Dienst software, archives had
to do extra work. NCSTRL could not simply be a service without
research components without compromising the funding position of the
collection. ArXiv has continued funding as a service. RePEc has no
funding and is vulnerable because of that.
Another possible reason for failure may have been the NCSTRL
decentralized architecture and sophisticated protocol. People
operating the servers needed to be highly computer literate. To
obtain such staff at reasonable cost, participating CS departments
hired graduate students. The problem with this strategy is that it
took some time to learn to do the job. When they were trained,
the next person took over. This implies discontinuities in archive
management. This is also a problem for RePEc but the technical
architecture of RePEc is much simpler, and therefore the problems are
much smaller. In addition, there is no special server needed to
run a RePEc archive, everything can be done "manually" an a file
system that is accessible by ftp and/or http.
Let us summarize. There are more Computer Science
papers on the Internet available
than before. There have been some attempt to organize them in
formal digital libraries. These have failed. They were
technologically wizard pieces but social misfits, i.e. they
were not accepted be the commonity. 10 years after the
Snowbird'92 conference, there still is no formal author self-archiving
system in computing that has wide-spread acceptance.
The closest thing that the computer scientists
have to an author self-archiving system is the CiteSeer system. It
collects papers autonomously, applies a few heuristics to see if it
looks like scientific paper, extracts the references where it finds
links to further papers etc. It is essentially an automated system, a
very powerful one for resource discovery but of limited value as an
alternative scholarly commonication outlet.
In Library and Information Sciences the situation is different. Since
2000, there is a project called ReLIS. It is a RePEc clone. It has a
user interface called
DoIS. It is a
welcome step in the right direction. The content is still quite
thin. If this project enlarges to Computer Science,
it could become an important
player.
6: The Open Archives Initiative
In the Summer of 1999,
Van de
Sompel, Krichel, Nelson et al. (2000),
conducted an experimental study to set up a common search interface to
e-print archives, known as the Universal Preprint Service (UPS)
prototype. They obtained permission from archives to there have been
recent moves towards improving the interoperability of e-print
archives.
A call for a
UPS meeting was issued by Paul Ginsparg, Rick Luce and Herbert Van de
Sompel. The motivation was to improve the interoperability of e-print
initiatives. At that time, these were, by order of size, arXiv, RePEc,
NCSTRL, and CogPrints. In addition, the electronic dissertation
network NDLTD , the digital Highwire
Press, the Physics reference service
SLAC-SPIRES , and a few
others were represented at the meeting. An initial proposal was on the
table for a "Santa Fe Convention" that would outline usage and
access conditions to metadata of archives. This proposal was rejected
as being too legalistic. Instead it was decided that the permission to
access should be given indirectly, through the construction of a
technical interface that is designed to provide access to metadata.
This means that a data which wanted to give access to its metadata to
a service provider. Thus the RePEc business model with its separation
of data providers and service providers was taken as an implicit
reference model.
The meeting did accomplish a rough consensus on the Santa Fe
Convention, later published as
Van de Sompel
and Lagoze (2000). The Santa Fe
convention included a harvesting protocol for metadata, using a subset
of Dienst. It also included a metadata set that was mandatory as a
common metadata set to be exchanged among compliant eprint
archives. This Santa Fe metadata set was only to be used as a minimum
metadata set. The convention explicitly welcomes other metadata sets to
be deployed.
The UPS initiative was soon renamed the
Open Archives Initiative. While the OAI started as an eprint
interoperability framework, soon keen interest from other
commonities in the protocol appeared. At a meeting of the OAI
technical technical committee in September 2000. A new protocol
document was outlined that can be used for the interoperability of any
type of digital library. The compatibility of the metadata harvesting
protocol with Dienst was broken. The eprint-specific Santa Fe
metadata set was deprecated. A working party for a new eprint specific
metadata was formed. The Academic Metadada Format, as discussed in the
next section, is an output of that working group.
The key deliverable from the OAI is the
Open Archives Protocol for Metadata Harvesting.
It is designed to provide a low-barrier to interoperability.
Key features include:
support for multiple metadata formats,
requirement for Dublin Core (DC) metadata as means of
global interoperability,
use of Hypertext Transfer Protocol (HTTP) transport, and
use of Extensible Markup Language (XML) encoding.
It has been suggested that the OAI framework be used to support
institution-based digital archives. These archives will contain
research results produced in an institution--from all
disciplines--and archived in the library. The ARNO
project (Academic Research
in the Netherlands Online) is a small-scale, but pioneering effort to
do just this. It remains to be seen how successful it will be. For
libraries to take on the rôle to archive, rather than to make
material available that is produced by intermediaries outside
academia, is a big step. It implies a change in their perceived
rôle.
If this change in the rôle of libraries is achieved, there can be
little doubt that the establishment of institution-based archives is a
better and faster way to free scholarly literature than waiting for
each discipline to embrace author self-archiving in its own
proprietary way. We expect that, whatever the final outcome, there
will be increasing numbers of both institution- and discipline-based
archives in the near term. We think that the degree of
interoperability and cooperation between these archives will strongly
influence their success. In our opinion, the most remarkable
contribution of the OAI to date, is not the harvesting protocol
itself, but it is the spirit of collaboration that it has created
between different commonities, different archives, and between data
and service
providers. It is clear that this will be beneficial to the providers
of such collections and their users.
7: The Academic Metadata Format
The Academic Metadata Format is described in
Brody et al. (2001) .
The primary aim of AMF is to support the free dissemination of
scholarly literature on the Internet.
One approach to increasing the amount on freely available scholarly
literature is to convince more authors to submit to
formal archives, such as arXiv or the RePEc archives. This will only
be successful if the benefits of submission of a paper outweigh the
cost of submission. The key element to maximizing the benefits of
submission is to raise the visibility of the submitted documents.
This can be achieved if the submitted documents appear in many
high-quality user services. The Open Archives Initiative facilitates
inclusion in many user services. However the quality of these services
will crucially depend on the quality of the metadata that is supplied
to them. As soon as a user service does more than full-text indexing,
it will need a large stock of high-quality metadata. AMF supports
this approach by providing a format which can facilitate the
exchange of high-quality metadata.
A second approach is to help improve the dissemination infrastructure
of the homepage publication movement. This necessitates improvement in
the quality of the metadata available through such pages by supporting
simple, author-supplied metadata. These data can then be fed into the
Open Archives system using an intermediate provider. For this or a
similar approach to work, there must be a basic level of metadata that
is simple and intuitive for a non-specialist.
Simplicity is a key design goal for AMF. AMF does not ignore
existing efforts. It adopts appropriate elements of
standard vocabularies that have been proposed, mainly by the
resource discovery metadata commonity. These are the Dublin Core
Metadata Element Set
DCMI (1999) , the Dublin Core
Qualifiers DCMI
(2000b), the DCMI Type Vocabulary
DCMI
(2000a), the the OpenURL syntax,
Van de Sompel,
Hostenbach and Beit-Aire (2000)
and vCard, Howes, Smith
and Dawson (1998).
As far as the syntax is concerned, the AMF design is constrained by
OAI adopting XML as the base format. AMF makes no use the Resource
Description Framework (RDF) promoted by the World Wide Web
Consortium as set out in
Lassila and Swick
(1999). It is not clear if this is a good
design choice. Implementing RDF would be a bit harder on providers of
metadata, but it would allow the use of software tools that manipulate RDF.
AMF aims to describe the following entity classes
resources
groups of resources
people
organizations
AMF allows for specification of properties for instances of the entity
classes, and the specification of relations between them. As far as
the general syntax is concerned, AMF borrows from natural
language. The XML elements that represent the fundamental entity
classes are called "nouns". XML elements that give properties to
nouns are called "adjectives". Some of them admit other adjectives
as children, but most of them admit no children. Just as in natural
language, adjectives are used to qualify a noun. To make a
relationship between two nouns, AMF uses "verbs". A verb must have
at least one noun as child element. Just as in natural language, verbs
are used to relate two nouns. An
example is useful here.
<text>
<title>Academic self-documentation
</title>
<hasauthor>
<person>
<name>Thomas Krichel</name>
</person>
</hasauthor>
<hasauthor>
<person>
<name>Simeon M. Warner</name>
</person>
</hasauthor>
</text>
This example uses the text
and person
nouns, the hasauthor
verb and the adjectives title
and name
. The use of relational
features is further enhanced through the use of identifiers. Each noun
may be given an identifier. It may also refer to an identified noun
as an alternative description. Thus the previous
example--leaving out Thomas for brevity--may be written
as
<text id="mitaka">
<title>Academic self-documentation
</title>
<hasauthor>
<person ref="simeon_warner"/>
</hasauthor>
</text>
<person id="simeon_warner">
<name>Simeon M. Warner</name>
<isauthorof>
<text ref="mitaka"/>
</isauthorof>
</person>
The interesting feature of this example is that the two records may be
maintained by different persons, in different files. The record about
the person may be maintained by the person herself. RePEc already uses
personal records just like those in the example, that are created and
maintained by registrants, see
Barrueco Cruz, Klink
and Krichel (2000). The id
and
ref
constructs may be used in many other circumstances. For
example they can be used to add records to classify identified
documents in subject classification schemes.
The most important innovation of AMF is not of a technical
nature. Rather, it is the vision behind the descriptive
model. It emphases the academic world as a process. Its description
relates resources to non-resources. This is in contrast with
the view of traditional library metadata effort, that focus exclusively
on resources.
8: Steps forward
Even if nothing more is done, there will be a many papers freely
available online. Most of them will be made available through the
homepage of their authors. They may be withdrawn at any time. There
is no bibliographic organization of these papers other than ones
organized by machine. Papers in the free layer can be found through
general Web search engines, or through a specialized engine like
ResearchIndex . But, since these
papers are in places where they can be modified by authors, it does
not appear to be possible to base a certification system on these
papers. Therefore, it is likely that a toll-gated layer of quality
controlled, final publication will persist.
This default scenario has been defended by
Arms
(2000). He envisages the co-existence
of an expensive layer of a research library that is powered by humans,
with the extensive quality control of the data, and a free layer that
is essentially computer generated. Author pressure, he speculates,
will make a lot of research papers openly available. But the
bibliographic layer, since this is costly to produce, is not likely to
be free. Some elements of the construction of the bibliographic
interface can
not be fully automated. This for example concerns the removal of
duplicates, dealing with moved, changed or merged collections,
maintaining a correct set of author data etc.
On the other hand, it appears unlikely that Computer Science will go
along with organization in a centralized archive like CoRR or a
successor to it. This would have several advantages. The quality
control of the papers can be done in overlay services. These overlay
service may contain links to papers on CoRR site that have been found
fit for inclusion in the overlay service. They may also contain
reviews of the free material. Another important advantage would be
that the digital preservation would be much easier. But despite these
advantages, if the commonity does not bother to use such an archive,
then these advantages can not be reaped.
All we can really hope
for is an intermediate scenario, with a set of open
archives that overlap and/or cooperate. There are already some
initiatives that could benefit from further collaboration. These
include the
ResearchIndex service
that we have already mentioned, the
DBLP bibliographic index, and the
Collection of Computer Science
Bibliographies. These
holdings of these collections could be combined and duplicates
removed. Exporting such a collection through an open archive would
open the possibilities for extended services, such as personal
registration, access logging, and peer-review.
9: Conclusions
Computer geeks have managed to make whole operating systems of
computers available on the WWW. At the same time academics have not
been able to make a coherent index to their works available, let alone
organize the free availability of the full text of these works.
Building such a bibliography is--from a
technical point of view--a much simpler task than the distributed
maintenance of a computer operating system, yet it seems to be more
difficult to do to those who are supposed to do it.
For Computing, Library and Information Sciences, this pathetic state of
affairs is particularly embarrassing. These disciplines wish to find
out--among other things--how knowledge is efficiently
managed and disseminated. If their commonities do not manage to build
an efficient system for themselves the methods that they propose will
have little credibility with other commonities.
While there is more and more freely accessible academic content on the
Internet, the organization of that content is much less useful than
the organization of content in formal archives and libraries. The
Open Archives Initiative (OAI) has developed protocols that improve on
this state of affairs by permitting interoperability between archives.
AMF is a able to encode academic output as a process, rather than a
set of resources. Large collections of AMF data will open the door to
applications in the area of quantitative assessment of
academic work. All this has very exciting potential.
At the same time recent negative developments like the demise of
NCSTRL and the stagnation of CoRR suggest that new entrants should
come to fill the gap.
To stimulate the provision of contents, the collection must be
conceived as an advertisement of the work of its contributors. To do
that, the contributing person or institution must be at the center of
the collection effort. We need a break with the tradition of
libraries. In a traditional library setting, the work is at the center
of the descriptive effort. In an academic self-publishing setting, the
contributor, rather than the works created by the contributor must be
the center of the collection effort.
The authors are grateful
for comments on this paper by an anonymous referee. Thomas Krichel is especially grateful for
the hospitality of the Department of Economics
at the University of Surrey , where much of
his contribution to this paper was made.