|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| Disintermediation of Academic Publishing through the Internet:
| An Intermediate Report from the
| Front Line
|
|
|
|
|
|
|
| Disintermediation of Academic Publishing through the Internet:
| An Intermediate Report from the
| Front Line
| 26 May 2001
|
|
|
|
|
|
| Abstract
|
| There has been a lot of discussion about the potential for free access
| to scholarly documents on the Internet. At the turn of the century,
| there are two major initiatives. These are arXiv, which covers Physics,
| Mathematics and Computer Science and RePEc, which covers Economics.
| These initiatives work in very different ways. This paper is the fruit of
| collaboration between authors working for both initiatives. It therefore
| reflects the perspective of people working to achieve change, rather
| than an academic perspective of pure observation.
| We first introduce both arXiv and RePEc, and then consider
| future scenarios for disintermediated academic publishing.
| We then discuss
| the issue of quality control from an e-print archive point of view.
| Finally, we review recent efforts to improve the interoperability of
| e-print archives through the Open Archive Initative (OAI). In
| particular, we draw on the workshop on OAI and peer review held
| at CERN in March 2001 to illustrate the level of interest
| in the OAI protocol as a way to improve scholarly commonication
| on the Internet.
|
|
|
| This
| paper is available in PDF for A4
| paper, and for
| letter size
| paper.
|
|
|
| 1: Introduction
| The exchange of contracts over the Internet is now commonplace.
| In the majority of cases, delivery of the merchandise occurs off-line.
| However, purely informational
| commodities--such as statistical data or pornographic
| pictures--can both be contracted upon and delivered over the
| Internet. That affords the holders of such commodities the opportunity
| to directly contract with customers in the way that was not possible
| off-line. The new medium thus provides an opportunity for
| disintermediation.
| In the academic world, the debate about the possible extent of
| disintermediation has concentrated on the rôle of academic
| publishing. A large part of academic writing attracts no payment from
| publication. For the sake of simplicity, this paper deals exclusively
| with scholarly works for which the author receives no payment.
| These will be referred to as "research papers" or "papers" for
| short. It is further assumed that the advent of the Internet will not
| change the reward structure in the academic world. We assume that
| authors will still want to write without payment, with the exclusive
| aim of achieving peer recognition.
| It has been proposed, most vociferously by
| Harnad
| (1995) and in many papers since,
| that the Internet will lead to free access to academic papers. In
| 1995 two large-scale initiatives were already well under way to work on
| realizing this vision. These were the arXiv archive for Physics, and
| the RePEc dataset for Economics. This paper has been written by
| people working within these initiatives. It is a contribution to the
| debate about "freeing the literature". The volume of papers on this
| topic is already quite large. What justifies a new paper on the topic?
| The process by which the academic literature is to be freed is most
| often referred to as author-self archiving. This is somewhat of
| a misnomer, but we will stick with the term for a moment. This paper
| has been written by practitioners of author self-archiving. Thus it
| offers practical insight from the process that are often not available
| to the outside observer. More generally, we have also been involved in
| recent efforts to coordinate author self-archiving initiatives in the
| Open Archives Initative. We will therefore take a look beyond the
| day-to-day running of our work and propose some steps that would bring
| the process forward. Thus we bring two innovative perspectives to the
| literature, a practical perspective and a normative one.
| The remainder of the paper is organized as follows. In Section
| 2 we introduce the arXiv archive, and in
| Section 3 we discuss the RePEc dataset. In
| Section 4 we suggest scenarios for future
| Internet-based academic publishing. In Section
| 5 and Section
| 6 we discuss steps forward and setbacks for the
| author self-archiving movement. In Section
| 7 we discuss the question of quality control. In
| Section 8 we present recent work of the OAI.
| In Section 9 we offer a few conclusions.
|
| 2: The arXiv archive
| The arXiv e-print archive is the largest and best-known archive of
| author self-archived scholarly literature. It is discipline-based and
| centralized, in that all submissions and the master database are at
| one site. Here we briefly review the history of arXiv and then
| discuss its place in the scholarly publishing environment.
| The arXiv e-print archive, formerly known as xxx,
| was started in August 1991 by Paul Ginsparg. The first subject
| area was theoretical high-energy physics, `hep-th' and it was
| initially little more than an email reflector for paper exchange. In
| 1992 an ftp interface was added and the archive expanded to other
| areas of high energy physics. Development since then has seen steady
| expansion of the subject areas to cover all of physics, mathematics,
| computer science and non-linear systems; and the addition of new
| interfaces and facilities as new technologies have become
| available. Significant developments have included: web interface
| (December 1993); automatic PostScript generation from TeX source (June
| 1995); PDF generation (April 1996); and web upload (June
| 1996). Recently, arXiv has been at the forefront of the Open Archives
| Initiative (OAI).
| arXiv serves an estimated 70,000 users in over 100 countries.
| There were about 13,000,000 downloads of papers in 2000. There were
| over 30,000 submissions in 2000. The number of new submissions is
| growing approximately linearly, with about 3,500 additional new
| submissions each year. The growth in submissions has had little
| impact on the administration of the arXiv although there is an
| increasing reliance on moderators to assure appropriateness. More that
| 98% of the submissions are entirely automated: 68% of them via the
| web, 27% via email and 5% via ftp. It is interesting to note that if
| just 15 minutes were required for each submission then a full-time
| staff of 7 people would be required to deal with new
| submissions. Instead, arXiv uses less than one full-time equivalent to
| deal with day-to-day operations.
| The Los Alamos site is funded by the US National Science Foundation and
| the US Department of Energy. The 15 mirror sites around the world are
| funded independently, the cost is just a few thousand dollars for a new
| machine every few years and a small amount of system administration effort.
| The high-energy physics commonity uses the
| TeX text formatting system almost
| exclusively, and this has been very convenient for arXiv. arXiv does
| not accept preprocessed TeX
| submissions, authors must submit the source. This allows generation of
| various types of output including DVI, PostScript in several flavors,
| and PDF. Expansion into other areas of physics means that there are
| now an increasing number of non-TeX
| submissions and this trend is sure to continue. Unfortunately, many
| common word-processing packages produce very inefficient and sometimes
| low-quality output unless used expertly. Experience shows that
| PostScript or PDF submissions require greater screening effort than
| TeX submissions. This is an example of
| how the physics and mathematics commonities differ from other
| commonities in a way that has favored author self-archiving.
| Before 1994, archives for some new subject areas were started at other sites
| using the same software. In November 1994 the data from the remote sites
| were moved to the central site, and the remote sites became mirrors.
| The reason for this reorganization was the need for stability. While a
| distributed model is appealing, sites relying on a single person,
| who's main occupation is not the archive, are not stable on a
| time-scale of years. A user perception of stability is important for
| the success of an archive. arXiv is once again investigating a more
| distributed model but any new system will be robust to the disappearance
| of any site.
| arXiv has not been an academic exercise, it was started from inside
| the commonity it aimed to serve. At all stages of expansion to cover a
| wider subject area, arXiv has been guided and promoted by members of
| the new fields. Nowadays, some commercial publishers tacitly
| acknowledge the legitimacy of arXiv by accepting submissions where the
| author simply quotes an arXiv identifier. Policies vary on whether
| the publisher permits the author to update the arXiv version to
| reflect changes made during the refereeing process. However, authors
| often ignore any prohibitions.
| arXiv is by far the largest author self-archiving project. Some
| proponents of change in scholarly publishing have heralded arXiv as
| the single mode forward for the change of scholarly
| commonication. While it is possible that the arXiv model could be
| successfully applied to all other disciplines, there are good reasons
| to question this. We will highlight two issues.
| First, there have been attempts to emulate the success of arXiv by
| building discipline-based archives for other disciplines. Two
| working examples are CogPrints at
| the University of Southampton and the
| Economics Working Paper Archive. Both
| have been operating for more than four years, yet neither has grown
| beyond 1,500 documents. Why have they not been more successful? This
| leads us to question the applicability of the centralized scheme to
| other disciplines. It should be noted that both initiatives have been
| run by members of the commonity they serve. Therefore they can not be
| dismissed as failing because they are foreign to the discipline and
| not sensitive to its particular needs.
| Second, even though arXiv has been successful in attracting
| submissions in several areas of physics and mathematics, an attempt to
| expand to computer science has had limited success. A high-level
| committee on the future of scholarly commonication in computing
| entered an agreement with arXiv to run a Computer Science Research
| Repository (CoRR).
| This has, to date, received rather little attention from computer
| scientists in spite of various publicity efforts within the commonity.
| Are we to conclude that computer scientists care little about formal
| on-line dissemination of research papers? Are they simply happy to
| leave their papers on their homepages?
| If current trends continue, arXiv will provide an increasingly
| complete free layer of primary research literature for the mathematics
| and physics disciplines. This leads us to ask whether it is likely
| that, over time, small-scale archiving, be it in departmental or
| homepage archives, will decline. There is certainly a trend for
| departmental archives in mathematics to migrate to arXiv. However, the
| Open Archives Initiative (OAI) has the potential to change the
| tradeoffs in the choice between centralized and decentralized
| archives. If cross-archive discovery and awareness tools exist for
| decentralized archives then institutional and department archives may
| become as effective as centralized archives such as arXiv.
|
| 3: The RePEc database
| RePEc is much less known than arXiv and it is also less well
| understood. There are two reasons for that. First, it is
| limited to the economics discipline. Second its business model is
| more abstract.
| Historically, RePEc grew out of the NetEc project. It was started by
| Thomas Krichel in February 1993. In April 1993 he opened a tiny
| collection of a electronic papers on an ftp server with a gopher
| interface operating at Manchester
| Computing. However, it was never the business
| plan of NetEc to become an electronic working paper archive for the
| discipline. Instead the idea was to collect data about printed and
| electronic papers as published by various sites so that they could be
| searched together on a single site. It was a library of bibliographic
| data rather than a archive. The project developed mirror sites in
| Japan (July 1996) and in the US (January 1997). In 1997, the RePEc
| dataset was created by the NetEc project, and two other projects that
| were active in the area, DEGREE
| and S-WoPEc. These projects agreed to
| exchange metadata in a common, purpose-built format called
| ReDIF. This
| metadata can be harvested following a simple harvesting protocol
| called the Guildford
| protocol. Shortly
| after the implementation of the protocol, several user services
| appeared that were build on the data. Thus the RePEc project is a true
| forerunner of the Open Archives Initiative. In 1998, data about
| Economics institutions was integrated into the database. In 2000, a
| personal registration service was opened. It allows persons to
| register associations between them and the document and institution
| data in the database. At the time of writing, it has over 160
| different archives that contribute metadata, and eight different user
| services. There are about 40,000 downloadable paper cataloged in
| RePEc. The NetEc project received £129,000 funding from the
| Joint Information Systems Committee JISC,
| it now runs without any external funding. Running such a large-scale
| operation with volunteer power only is a remarkable technical and
| organization achievement.
| RePEc is not only a database of papers in economics, but it
| also contains data about economics institutions and academic
| economists.
| The registration of institutions is accomplished through the EDIRC
| project. The acronym stands for "Economics Departments, Institutions
| and Research Centers". This dataset has been compiled by Christian
| Zimmermann, an
| Associate Professor of Economics at Unversité du Québec à
| Montréal on his own account, as a public
| service to the economics profession. The initial intention was to
| compile a directory with all economics departments that have a web
| presence. Since there are many departments that have a web presence
| now, a large number are now registered, about 5,000 of them at the
| time of this writing. All these records are included in RePEc. For all
| institutions, data on their homepage is available, as well as postal
| and telephone information.
| At the moment, EDIRC is mainly linked to the rest of the RePEc data
| through the HoPEc
| personal registration service. This service can be used by economists
| to register themselves as authors of the documents that are contained
| in RePEc. To date 6% of all papers have at least one of the authors
| who is a registered person. The HoPEc registrations will in the future be
| used for building a collection of papers held in the homepages of
| these authors. Already now, the collection is used to link from the
| papers of authors to their homepage and for the provision of current
| contact information. Clearly, current contact information may be quite
| different from the contact information in the bibliographic data.
| Thus RePEc is probably the most ambitious project in Internet
| scholarly commonication to date. The final aim is that every author,
| institution and document in economics will be registered in a
| database. Thus the project will need constant attention and never be
| finished. While the current status of the project is a remarkable
| achievment, there are two noteworthy problems.
| First, RePEc have rejected centralization of papers as a means forward
| for the collection of grey literature. But happens that RePEc has a
| commercial competitor in the "Economic Research Network" operated by
| "Social Science Electronic Publishing",
| that operates on a centralized model. Much of the offerings on their
| site are free. There is a substantial overlap with the contents of
| RePEc. At the time of writing, they have reached about 12,000
| electronic papers. Thus they are still much smaller, but they seem to
| be expanding at a faster rate. There are two reasons for the success of
| the company. First, they have extensive statistical reports on
| download of papers, something that is next to impossible to collect in
| the context of a multi-service operation. Second, the company has
| started cooperative agreements with departments who wish to sell the
| working papers that they produce. If this trend expands we will see
| in the Internet era for-fee access to things that used to be free in
| the paper days.
| Second, while there are some commercial publishers that already collaborate
| with RePEc, it is not easy to forecast that there will be wide-spread
| collaboration. There are two reasons for that. First in economics, no
| publisher holds a large amount of material. Many publishers produce
| economics journals, but for each publisher it is a small amount of
| their business. Therefore the cost to working with RePEc-specific
| bibliographic system is relatively high as compared to the returns.
| Therefore the contents of RePEc is still dominated by working
| papers.
|
| 4: Future scenarios of publication
| In the introduction, we pointed out that the term "author
| self-archiving" is somewhat of a misnomer. In the arXiv scenario it
| is arXiv that do the archiving. In the RePEc scenario the archiving is
| performed by the intermediate providers like departments and research
| centers. To be quite precise,
| "author self-archiving" would be more appropriately applied to authors
| depositing their papers in homepages. However, this type of activity
| can hardly be classified as archiving because the concept of an
| archive suggests a long-run storage facility which is rarely the
| case with personal homepages.
| Thus we would like to introduce the concept of academic
| self-publishing as a replacement for author self-archiving.
| By academic self-publishing we mean a trend for the academic sector
| to take over the publication of its own output. It
| should be noted that we understand "publishing" here in the wide
| sense of the word as "making public", rather than in narrow sense of
| "making public through an intermediate agent to whom copyright is
| being transfered".
| Both RePEc and arXiv fit this description.
| Academic self-publishing of research material is currently being
| experimented with on a per-discipline wide basis. A useful discussion
| of some initiatives is
| Kling and McKim
| (2000). We agree with the basic
| message of their paper. There will be no unique method by
| which networked scholarly
| commonication will be running across disciplines. Discipline-specific
| differences are likely to persist. For each discipline, it is most likely
| that evolution depends on
|
| the established commonication patterns prior to the proliferation
| of electronic networks
| the presence or absence of entrepreneurial pioneers to
| stimulate change
| the political and financial environment of the discipline
|
| in that order. The persistence of discipline-specific
| features in academic self-publishing is the first assumption
| for our discussion here.
| A second assumption is that authors will continue to write
| research papers. These are discreet accounts of individual research
| findings written by one author or a small group of co-authors. Thus we
| reject the idea that in the future, work of individual authors becomes
| dissolved in a global scientific hypertext net. It this case it would
| be difficult to trace the responsibility for the research results.
| Academic reward structures depend on precise knowledge of the
| authorship of research papers.
| A third assumption is that the Internet will not go away. It will
| always be available as a medium for the distribution of research papers.
| Therefore we assume that a free layer of research papers will always
| exist, even if sparsely populated.
| A fourth starting assumption is that in all disciplines some form
| of toll-gated access to research papers will survive. This assumption
| can be justified by the idea that there is little chance that all the
| teaching material will be made freely available. In all disciplines
| authors of teaching material have always received a monetary
| reward. Since the border between research and teaching documents is
| fuzzy, it appears unlikely to us that there will ever be a free access
| to all research documents.
| Having made these starting assumptions, we have two
| points still to determine
|
| the extent of the services that are freely available
| the size of any quality gap between toll-gated and
| non-tollgate papers
|
| Here we propose three terminal scenarios. Individual disciplines may adopt
| one scenario or develop a mixture between scenarios.
| In scenario one, there is a free layer of research documents made
| available on the web by their authors. They may be withdrawn at any
| time. There is no bibliographic organization of these papers other
| than ones organized by machine. Papers in the free layer can be found
| through general Web search engines, or possibly through a specialized
| engine like
| inquirius.
| But, since these papers are in places where they can be modified by
| authors, it does not appear to be possible to base a certification
| system on these papers. Therefore, there will be a toll-gated layer of
| quality controlled, final publication. We assume that these
| quality-controlled collections of research papers will have access
| restrictions. Most of them will only be accessible to subscribers.
| This toll-gated layer will have good bibliographic descriptions that
| are specific to each vendor, but there will be no common catalog in
| the public domain.
| Although this scenario may appear unlikely, it has been defended by
| Arms
| (2000). He envisages the co-existence
| of an expensive layer of a research library that is powered by humans,
| with the extensive quality control of the data, and a free layer that
| is essentially computer generated. Author pressure, he speculates,
| will make a lot of research papers openly available. But the
| bibliographic layer, since this is costly to produce, is not likely to
| be free. Some elements of the construction of the free interface can
| not be fully automated. This for example concerns the removal of
| duplicates, dealing with moved, changed or merged collections,
| maintaining a correct set of author data etc.
| In scenario two there will be a free layer of research papers and
| there will be a free bibliographic system too. The layer of toll-gated
| and free papers will be decentralized. The bibliographic system will
| comprise both free and toll-gated papers and indications on which
| servers they live. The bibliographic layer may be centrally or
| decentrally organized. In a centralized system all providers of papers
| will submit to a central index of all papers. It is an open question
| how such a server may be organized and funded. In a decentralized
| bibliographic system, all sites that provide papers will have a
| metadata layer that can be harvested through a machine interface. In
| this scenario the problem of funding a central agency does arise only
| to a small extent. The only central agency that we think will be
| needed is one that registers participants in the catalog exchange
| scheme. These participants may either be providers of catalog data or
| providers of services using the catalog data.
| In scenario three there will be a large archives of papers,
| which comprise the vast majority of papers in a discipline. The
| quality control of the papers will be done in overlay services. These
| overlay service may contain links to papers on the central site that
| have been found fit for inclusion in the overlay service. It may also
| contain reviews of the free material. We would expect that most
| overlay services are free but that is not crucial to the scenario.
| To compare the three scenarios, let us stress that we imply that there
| is an implicit quality ladder. Scenario 1 is the least controlled,
| scenarios 2 and 3 improve upon this because they
| allow for better information retrieval and better facilities for
| quantitative evaluation of the research process. Scenario 3 improves
| over scenario 2 because it makes the long-run preservation of the
| archived data much more secure. A centralized system would be likely to
| find it much easier to implement format conversion than a decentralized
| system.
|
|
| 5: Steps forward
| Having established terminal scenarios we will discuss steps
| that may lead to them.
| If nothing is done, by default, we will have scenario 1.
| The academic sector may find ways to implement scenarios 2 and 3,
| but it is not clear today what should be done. In this section, we make
| some suggestions.
| First, we believe that change to the scholarly commonication will be
| discipline commonity driven. Anyone proposing a new service
| should make sure
| that they receive the maximum backing of the commonity of the
| discipline that they are supposed to serve. The successful services
| have managed to do that. These statements may say the obvious, but
| there are examples where such elementary advice has not been
| followed. The most prominent is
| Varmus
| (1999), the initial call for the
| creation of PubMed Central. This call was clearly a simple translation
| of the principle of arXiv to the biomedical field. Little thought
| was given to adapting the arXiv business model to the needs of
| the biomedical commonity. Much of the negative reaction to the
| proposal would have never occurred if its discipline-specific
| implications would have been thought out before.
| Second, proper attention has to be played on the motivation of
| contributors. Too many schemes have been proposed where a lot of
| effort has been spend on developing a user interface, but little
| effort is spent on populating the services. This holds in particular
| for projects that funded by the library commonity. Libraries are
| accustomed to address end-users, and therefore do not have paid much
| attention to the question of contributors. There has been general
| agreement that public access to scientific documents does wonders to
| the exposure that these documents receive. But this additional
| exposure has to be demonstrated to contributors. If it can not be
| demonstrated, with figures at hand that organized contents is better
| than unorganized contents, then then it is difficult to see much
| contents organization taking place.
| The first step forward must be the collection of contents. Any
| collection of contents that is organized with some metadata, as long
| as it is large enough, will stimulate interest. The problem with most
| of the existing attempts is that the contents is simply too thin. To
| stimulate the provision of contents, the collection must be conceived
| as an advertisement of the work of its contributors. To do that, the
| contributing person or institution must be at the center of the
| collection effort. Again, we need a break with the tradition of
| libraries. In a library setting, the work is at the center of the
| descriptive effort. In an academic self-publishing setting, the
| contributor, rather than the works created by the contributor must be
| the center of the collection effort.
| For most disciplines, it should be possible to gather free
| research papers that are available on the Internet in a gateway
| catalog. All disciplines have some form of informal commonication
| channel, and many of these papers in these channels can be made freely
| available if somebody is willing to "put them up". If a sufficient
| number of papers in a small discipline can be made available, this can
| really change the way the sub-discipline works. Such
| small-scale efforts, opened up to the wider academic word through on
| OAI compliant archives would be an important step
| forward. Unfortunately, the current climate in academic institutions
| is such that, while writing papers is valued, collecting and organizing
| them is often not valued so such work is not incentive compatible.
|
| 6: Steps backward
| There are a number of failed collections whose
| ruin can provide valuable insights into why collection efforts
| fail. An interesting example is the
| NCSTRL collection
| of Computer Science Technical Reports.
| On that site, we read
|
| NOTICE: This site is being maintained on a legacy basis and support
| for it is minimal. As a result, some of the functionality is
| limited. We will be posting a message in the near future about the
| future status of this site. Posted 2000-03-05.
|
| This message that was supposed to be forthcoming is not found on the
| site. Despite watching relevant mailing lists closely, the authors
| have not found that announcement. It seems that NCSTRL, if not
| terminated officially, is in serious crisis.
| The failure of NCSTRL is worrying for two reasons. First it is a
| failure of work for the Computer Science area. Many digital library
| efforts find it very hard to access appropriate computing expertise at
| times when such expertise is in high demand by other sectors that
| have more financial resources. But such expertise
| should have been plentiful it Computer Science departments. Second,
| the failure comes after years of development and on a collection of
| considerable scale. NCSTRL started as early as 1993 and has about
| 10,000 papers available in electronic format. After arXiv and RePEc,
| this is the third-largest collection of freely downloadable documents.
| We suspect that the failure is not of a technical but managerial
| nature. As outsiders, we can only make some speculations about what
| the problems are.
| One possible explanation in the double nature of NCSTRL. It was
| conceived as a digital document collection and a digital library test
| bed. It those circumstances, there is constant experimentation.
| NCSTRL worked on an elaborate protocol called Dienst to run the
| collection. Dienst was a language with many instructions call
| verbs. To allow several versions of Dienst to run on servers, each
| verb in Dienst had a version number, and more recent versions of the
| Dienst software could still execute the old verbs. Nevertheless in
| order to install more recent versions of Dienst software, archives had
| to do extra work. NCSTRL could not simply be a service without
| research components without compromising the funding position of the
| collection. ArXiv has continued funding as a service, RePEc has no
| funding and is vulnerable because of that.
| Another possible reason for failure may have been the NCSTL
| decentralized architecture. Many participating sites were staffed by
| graduate students. This implies discontinuities in archive management.
| This is also a problem for RePEc but the technical architecture of
| RePEc is much simpler, and therefore the problems are much smaller.
| To some up this discussion, it is crucial that the business plan for
| any collection effort is well thought out. Everything else depends on
| it. The problem with author self-publishing in academic circles is that
| many academic institutions do not have a good record of enterprise, in
| particular in cases where the activity is unlikely to generate much revenue.
|
| 7: The question of quality control
| Defenders of the status quo in publishing defend the system because of
| the quality classification of standard papers that it affords. A free
| publication system must consider the feasibility of
| filtering in a world where anybody can publish anything.
| To tackle the problem of quality control, our experience from managing
| self-publication systems suggest a simple model of paper
| quality. We defined just two levels of quality: standard and inadequate.
| Inadequate papers are so poor that almost everybody with a Master's
| level degree in the discipline, who would read these documents would
| become aware that they simply do not contain scientific material as it
| is recognized within the discipline.
| Standard papers are the remaining papers, the majority of which will be
| unremarkable. A few will be read by a sizable group, even fewer will be
| cited in the future, and still fewer will be considered seminal works.
| We claim that it is trivial for a member of the appropriate disciple
| to separate inadequate papers from standard papers. We have a
| moderation process instead of the conventional refereeing process.
| On the other hand, distinguishing between different shades of quality
| in standard papers is very difficult and costly in terms of time and
| effort. In the next Section, we will show how the OAI protocols can be
| used to implement a technical infrastructure that would be useful for
| this process. In the remainder of this section, we will look at how
| the filtering of inadequate submissions can be automated.
| The object of automated filtering is to reduce the cost of the
| moderation process which still maintaining sufficient quality control.
| One approach is to allow submission based on institutional affiliation;
| our experience shows that a paper from Cornell University is likely
| to meet the requirements of standard quality. In this approach
| might also assume some institutional responsibility for submissions.
| However, in the construction of such systems it is important to
| cater for cases where unaffiliated authors wish to submit
| work that meets the requirements of standard quality. It would
| be a retrograde step for the author self-publication movement
| to be more prejudiced than the existing system.
| Another way to function is to construct registries of personal and
| registration information for academics who are allowed to publish
| on open e-print archives. Depending on the discipline these
| registries might be administered by scholarly societies, academic
| institutions or the archives themselves, with policies reflecting
| the views of the commonity.
| A system which avoided duplicate registrations would be labor saving for
| both maintainers or the registries and authors who would have to keep
| their information up-to-date. A shared registry would have several other
| benefits. It could contain contact data shared by several organizations
| in the academic sector, such as scholarly societies. At the moment
| each scholarly society maintains its own registry of members, with a
| shared set of personal data, a society would simply maintain the
| handles of it members in the shared personal dataset.
| There could be a mix of these approaches where personal registration
| might be based partly on the affiliation of a person
| with a recognized institution.
|
| 8: The Open Archive Initiative
| Stimulated by work of
| Van de
| Sompel, Krichel, Nelson et al. (2000),
| there have been recent moves towards improving the interoperability of
| e-print archives. This work is now called the
| Open Archive
| Initative (OAI). When the OAI started,
| there was little interoperability between e-print initiatives and
| cross-archive resource discovery was the focus of the initial meeting
| in Santa Fe. The scope of the initiative has expanded considerably
| since then.
| The basic business model proposed by the OAI was inspired by the RePEc
| project. In particular, the Open Archives Metadata Harvesting (OAMH)
| protocol separates data provision and service provision as pioneered
| by RePEc since 1997. The OAMH protocol is a simplification of the
| subset Dienst protocol experimented with after the Santa Fe meeting.
| It was designed to provide a very a low-barrier to interoperability.
| Key features include:
| support for multiple metadata formats,
| requirement for Dublin Core (DC) metadata as means of global interoperability,
| use of Hypertext Transfer Protocol (HTTP) transport, and
| use of Extensible Markup Language (XML) encoding.
| There are four means to select between records
|
| Datestamp of a record
| Identifier of a record
| Metadata format(s) available for a record
| Sets, which are groups of records
|
| There are three commands to support other protocol requests
|
| Identity gives basic information about an archive
| ListMetadataFormats shows the metadata formats that
| are supported
| ListSets shows the sets
|
| There are the three other commands to access the records
|
| ListRecords returns records
| ListIdentifiers returns identifiers only
| GetRecord returns a single record
|
| This protocol has attracted a lot of interest from digital
| library commonities. It is hoped that it will provide a framework
| for the solution of many interoperability issues in digital libraries.
| We conducted a survey of registered OAI repositories on 8 March 2001 to
| see how many data providers were operational
| less than 2 months after the OAMH protocol was announced.
| In the list below we show the name of the archive and the number of
| record identifiers returned from a ListRecords request.
|
| |
| name |
| number
|
| |
| |
| arXiv |
| 155522
|
| |
| |
| OCLC Theses & Dissertations Repository |
| 102762
|
| |
| |
| NACA |
| 6352
|
| |
| |
| M.I.T. Theses |
| 5196
|
| |
| |
| Oxford Text Archive |
| 1290
|
| |
| |
| Perseus Digital Library |
| 1030
|
| |
| |
| CogPrints |
| 1028
|
| |
| |
| NSDL at Cornell |
| 870
|
| |
| |
| PhysNet |
| 472
|
| |
| |
| Humboldt University of Berlin |
| 464
|
| |
| |
| Resource Discovery Network |
| 388
|
| |
| |
| A Celebration of Women Writers |
| 142
|
| |
| |
| European Language Resources Association |
| 183
|
| |
| |
| Linguistic Data Consortium |
| 216
|
| |
| |
| University of Tennessee Libraries |
| 201
|
| |
| |
| The Natural Language Software Registry |
| 78
|
| |
| |
| California Digital Library |
| 3
|
| |
| |
| The OAI sponsored a
| workshop on OAI and peer review in Europe
| between March 22 and 24 of 2001. In was organized by the Access
| Division of LIBER, the Ligue des Bibliothèques Européenne de recherche.
| One working group, led by Herbert Van de Sompel, discussed ways in
| which the OAI framework could be used in the certification ,
| by peer review or some other method, of academic works.
| We outline some of the ideas from this discussion in the
| remainder of this section. We hope that it will illustrate the
| potential of the OAMH protocols for more than just the resource
| discovery aspect of scholarly commonication.
| Let us first consider what certification means in the current environment
| of traditional paper journals. The type of certification is usually
| peer-review and the indication of this is publication in a
| particular journal. The journal reference is in essence a
| certification metadata for the work even if it exists somewhere
| other than in the journal, for example the author's homepage or
| in some other archive. We will not here consider the issue of
| whether this other copy is true to the certified version. The
| journal publisher provides, in addition to other things, a
| certification service.
| The simplest scenario to consider is the certification of a paper
| that is first placed on an e-print archive. The certification process
| could be initiated by pull from the certification service or push
| from the archive, but is more likely to occur by direct author
| action (such as an e-mail). If the paper exists on an e-print archive
| then the certification service can retrieve it, and also harvest a
| metadata record using the OAMH protocol. After or during the
| certification process the certification service could expose
| certification metadata using the OAMH protocol. This could be
| harvested by the original archive and associated with the e-print
| provided it contained the identifier of the e-print. The American
| Physical Society (APS) already accepts submissions from arXiv (by quoting
| the e-print identifier). There was agreement between the APS
| and arXiv to test this scheme.
| Given that some OAI data providers might then have records for
| both certified and non-certified papers, how could one harvest
| information about only certified records within OAI?
| Technically, there are several ways that this could be accomplished.
| One way is through multiple metadata sets, perhaps one
| for discovery metadata, one for the certification, etc.
| Using OAMH protocol, it is then possible to harvest metadata only
| for the certified records by requesting the certification metadata
| set (objects that do not have that metadata will not be returned).
| Another approach would be for the data provider to expose
| certified and non-certified sets which could then be harvested
| separately.
| The group also discussed the contents of the certification
| record. It was thought that any certification metadata scheme
| (i.e. data about the certification) should be applicable
| across different commonities. At the very minimum it must contain
| a link to the certified document (or a metadata record for it),
| information about the certification service (e.g. journal name), and
| a timestamp for the certification. It might also include a timestamp
| at which the original harvest too place, and an indication of the
| current status for an open certification process.
| There might also be an indication of the certification type.
| A controlled vocabulary would be desirable for this.
| From the above discussion, it is clear that the OAMH protocol
| could be a very useful tool for the transformation of scholarly
| commonication through the Internet. Using the OAI, we can accomplish
| new tasks in a collaborative way that are not possible in the
| established setup of essentially closed initiatives.
|
| 9: Conclusions
| While there is more and more freely accessible academic content on the
| Internet, the organization of that content is much less useful than
| the organization of content in formal archives and libraries.
| The Open Archives Initiative (OAI) is well suited to develop protocols
| that improve on this state of affairs by permitting interoperability
| between archives.
| It has been suggested that the OAI framework be used to support
| institution-based digital archives. These archives will contain
| research results produced in an institution and archived in the
| library.
| The ARNO project
| (Academic Research in the Netherlands Online)
| is a small-scale, but pioneering effort to do just this.
| It remains to be seen how successful it will be.
| For libraries to
| take on the rôle to archive, rather than to make material available
| that is produced by intermediaries outside academia, is a big step. It
| implies a change in their perceived rôle. It remains to be seen if
| libraries will take up that challenge.
| It may be that the establishment of institution-based archives is a
| better and faster way to free scholarly literature than waiting for
| each discipline to embrace author self-publishing in its own
| proprietary way. However, this is by no means clear and the same OAI
| framework that supports interoperability between institution-based
| archives also supports interoperability between discipline-based
| archives. We expect that, whatever the final outcome, there will be
| increasing numbers of both institution- and discipline-based archives
| in the near term. We think that the degree of interoperability between
| these archives will strongly influence their success.
|
|
|
| The authors are grateful to
| Christopher F. Baum and Ivan V. Kurmanov
| for comments on this paper. More comments will be gratefully
| received.
|
|
|
|
|
|
|