The RePEc database and its Russian partner </td><td>Socionet

The RePEc database and its Russian partner Socionet

2001-09-13


Thomas Krichel		Sergei I. Parinov

Palmer School		Institute of Economics and Industrial Engineering

Long Island University	Siberian Branch of the Russian Academy of Sciences

720, Northern Boulevard		17, Lavrentiev Prospekt

Brookville, NY 11548-1300		630090 Novosibirsk

USA	Russia

krichel@openlib.org		parinov@ieie.nsc.ru

Abstract

After arXiv.org, the RePEc economics library offers the second-largest library of freely downloadable scientific papers in the world. RePEc has a different business model and a different content coverage than arXiv.org. This paper addresses both differences.
As far as the business model is concerned, RePEc has an open architecture. It is open in two ways. It is open for contribution (third parties can add to it), and it is open for implementation (many user services may be created). Conventional libraries--including most digital libraries--are closed in both directions. In this paper, we specifically address a fully interoperable extension to RePEc, the Socionet project.
As far as the content coverage is concerned, RePEc seeks to build a relational dataset about scholarly resources and other content relating of to these resources. This basically means the identification of all authors, all papers and all institutions that work in economics. Such an ambitious project can only be achieved if the cost to collect data is decentralized and low, and if the benefits to supply data are large.

This paper is available in PDF for A4 paper, and for letter size paper.

1: Introduction

It is quite trivial to see that the Internet allows for people to collaborate on projects to collect information. Such collaboration is particularly interesting when the collection of academic information is concerned. The providers of such information have a vested interest in its widest dissemination. The Internet offers the technical infrastructure to make a free scholarly dissemination system happen. What is needed are social infrastructures that get the collection of the ground with a minimum of external subsidy.

At the time of writing, there are two large-scale discipline-based systems to further scholarly dissemination over the Internet. The first, and most widely known is the arXiv archive for papers in physics, mathematics and computer science. The second is the RePEc digital library for economics. This paper is concerned with the latter. RePEc provides an "Open Library". Basically, an open library is a collaborative framework for the supply and usage of document data. Using this collaborative framework, RePEc has been able to collect data about over 50,000 electronic publications. Most of them are freely available. The complete dataset also includes data on off-line publications, software items, institutions and author contact details. The collection has over 150,000 items cataloged at the time of writing. These are used in about ten different user services.

In Section 2, we introduce RePEc as a document data collection. In Section 3, we push the RePEc idea further. We discuss the extension of RePEc that allows it to describe the discipline, rather than simply the documents that are produced by the members of the discipline. In Section 4 we discuss recent efforts to improve the incentives for data providers. In Section 5 we introduce the socionet RePEc user service. We conclude the paper in Section 6.

The efforts of which RePEc is the result go back to 1992. We deliberately stayed away from a description of the history of the work to concentrate on the current status. Therefore, insufficient attribution is given to the people who have earned historic merits by contributing to the RePEc effort. See Krichel (1997) for an account of the early history of the NetEc projects. These can be regarded as precursors of RePEc.

2: The RePEc document dataset

2.1: Origin and motivation of RePEc

A scholarly commonication system brings together producers and consumers of documents. For the majority of the documents, the producers do not receive a monetary reward. Their effort is compensated through a wide circulation of the document and a mark of peer approval for it. Dissemination and peer approval are the key functions of scholarly commonication.

Scholarly commonication in economics has largely been journal-based. Peer review plays a crucial role. Thorough peer review is expensive in time. According to Trivedi (1993), it is common that a paper takes over three years from submission to publication in an academic journal, not counting rejections. From informal evidence, slowly rising publication delays have stabilized in the past few years as journal editors have fought hard to cut down on what have been perceived to be intolerable delays.

Researchers at the cutting edge cannot rely solely on journals to keep abreast of the frontiers of research. Prepublication through discussion papers or conference proceedings is now commonplace. Access to this informally disseminated research is often limited to a small number of readers. It relies on the good will of active researchers to disseminate their work. Since good will is in short supply, insider circles are common.

This time gap between informal distribution and formal publication can only fundamentally be resolved by reforming the quality control process. The inconvenience resulting from the delay can however be reduced by improving the efficiency of the informal commonication system. This is the initial motivation behind the RePEc project. Its traditional emphasis has been on documents that have not gone through peer review channels. Thus RePEc is essentially a scholarly dissemination system on the Internet. It is independent of the quality review process.

2.2: Towards an Internet-based scholarly dissemination system

The most important type of material relevant to scholarly dissemination are research papers. One way to organize this type of material has been demonstrated by the arXiv.org preprint archive, founded in 1991 by Paul Ginsparg of the Los Alamos National Laboratory with an initial subject area in high energy physics. Authors use that archive to upload papers, which remain stored there. ArXiv.org has now assembled around 170,000 papers, covering a broad subject range of mathematics, physics and computer science, but concentrating on the original subject area. An attempt has been made to emulate the arXiv.org system in economics with the "Economics Working Paper Archive" (EconWPA) based at Washington University in St. Louis. Its success has been limited. There are a number of potential reasons:

Economists do not issue preprints as individuals; rather, economics departments and research organizations issue working papers.

Economists use a wider variety of document formatting tools than physicists. This reduces the functionality of online archiving and makes it more difficult to construct a good archive.

Generally, economists are not known for sophisticated practices in computer literacy and as such, they are more likely to encounter significant problems with uploading procedures.

There is considerable confusion as to implications of networked pre-publication on a centralized, high-visibility system for the publication in journals.

Economics research is not confined to university departments and research institutes. There are a number of government bodies--central banks, statistical institutes, and other--who contribute a significant amount of research in the field. These bodies, by virtue of their size, have more rigid organizational structures. This makes the coordination required for a central research paper dissemination more difficult.

An ideal system should combine the decentralized nature of the Web, the centralized nature of the arXiv.org archive, and a zero price to end users. To explain why the end-user access to the dissemination system should be free, it is useful to refer to the distinction between trade authors and esoteric authors, as done by Harnad (1995). Authors of academic documents are esoteric authors rather than trade authors. They do not expect payments for the written work; instead, they are chiefly interested in reaching an audience of other esoteric authors and to lesser extent, the public at large. Therefore the authors are interested in wide dissemination. If a tollgate to the dissemination system is set-up, then the system as such falls short of an ideal one.

The way to implement such a collection is to create a network of archives, where each participating institution will stores data about their publications. This network is open in the sense that persons and organizations can join by contributing data about their work. It is also open in the sense that user services can be created from it. This double openness promotes a positive feedback effect. The larger the collection's usage, the more effective it is as a dissemination tool, and thus more authors and their institutions join as participation is open. The larger the collection, the more useful it becomes for researchers. This leads to more usage.

Bringing a system to such a scale is a difficult challenge. Man is an animal of habit. Scholarly commonication systems have evolved time. Academic careers are directly dependent on the results of the scholarly commonication. Therefore, change in the this area is slow because it involves important aspects of the lives of those who are the potential implementors of the change. A scholarly dissemination system on the Internet is more likely to succeed if it enhances current practice, but it does not replace it. The distribution of informal research papers in the past has been based on institutions issuing working papers. These are circulated through exchange arrangements. RePEc is a way to organize this process on the Internet.

2.3: The architecture of RePEc

RePEc can be understood as a decentralized academic publishing system for the economics discipline. RePEc allows researchers' departments and research institutes to participate in a decentralized archival scheme which makes information about the documents that they publish accessible via the Internet. Individual researchers may also openly contribute, but they are encouraged to use EconWPA.

Each contributor needs to maintain their own collection of data using a set of standardized templates. Such a collection of templates is called an "archive". An archive operates on an anonymous ftp server or a Web server controlled by the archive provider. Each archive provider has total control over the contents of its archive. There is no need to transmit documents elsewhere. The archive management retains the liberty to post revisions or to withdraw a document.

2.3.1: An example archive

Let us look at an example. The archive of the OECD is at http://www.oecd.org/eco/RePEc/oed/. In that directory we find two files. The first is oedarch.rdf:

Template-Type: ReDIF-Archive 1.0
Handle: RePEc:oed
Name: OECD Economics Department
Maintainer-Email: eco.contact@oecd.org
Description: The working papers of the 
 Economics Department of the OECD
URL: http://www.oecd.org/eco/RePEc/oed

This file gives basic characteristics about the archive. It associates a handle with it, gives an email address for the maintainer, and most importantly, provides the URL where the archive is located. This archive file gives no indication about the contents of the archive. The contents list is in a second file, oedseri.rdf:

Template-type: ReDIF-Series 1.0
Name: OECD Economics Department working
  papers
Type: ReDIF-Paper
Provider-Name: OECD Economics Department
Provider-Homepage: 
 http://www.oecd.org/eco/eco/
Maintainer-Email: eco.contact@oecd.org
Handle: RePEc:oed:oecdec

This file lists the content as a series of papers. It associates some provider and maintainer data with the series, and it associates a handle with the series. The format that both files follow is called ReDIF. It is a purpose-built metadata format. See Krichel (2000) for the complete documentation of ReDIF.

The documents themselves are also described in ReDIF. The location of the paper description is found through appending the handle to the URL of the archive, i.e+dot at http://www.oecd.org/eco/RePEc/oed/oecdec. This directory contains ReDIF descriptions of documents. It may also contain the full text of documents. It is up to the archive to decide whether to store the full text of documents inside or outside the archive. If the document is available online--inside or outside the archive--a link may be provided to the place where the paper may be downloaded. Note that the document may not only be the full text of an academic paper, but it may also be an ancillary files, e.g. a dataset or a computer program.

Participation does not imply that the documents are freely available. Thus, a number of journals have also permitted their contents to be listed in RePEc. If the person's institution has made the requisite arrangements with publishers (e.g. JSTOR for back issues of Econometrica or Journal of Applied Econometrics), RePEc will contain links to directly access the documents.

2.4: The documents available through RePEc

Over 160 archives in 25 countries currently participate in RePEc, some of them representing several institutions. Over 100 universities contribute their working papers, including U.S. institutions such as Berkeley, Boston College, Brown, Maryland, MIT, Iowa, Iowa State, Ohio State, UCLA, and Virginia. The RePEc collection also contains information on all NBER Working Papers, the CEPR Discussion Papers, the contents of the Fed in Print database of the US Federal Reserve, and complete paper series from the IMF, World Bank and OECD, as well as the contributions of many other research centers worldwide. Last, but not least, RePEc also includes the holdings of EconWPA. In total, at the time of writing in September 2001, over 50,000 items are downloadable.

The bibliographic templates describing each item currently provide for papers, articles, and software components. The article templates are used to fully describe published articles. They are currently in use by the Canadian Journal of Economics, Econometrica, the Federal Reserve Bulletin, and IMF Staff Papers, the Journal of Applied Econometrics, the RAND Journal of Economics. These are only a few of the participating journals. Participation does not imply that the articles are freely available.

The RePEc collection of metadata also contains links to several hundred "software components"--functions, procedures, or code fragments in the Stata, Mathematica, MATLAB, Octave, GAUSS, Ox, and RATS languages, as well as code in FORTRAN, C and Perl. The ability to catalog and describe software components affords users of these languages the ability to search for code applicable to their problem--even if it is written in a different language. Software archives that are restricted to one language, such as those maintained by individual software vendors or volunteers, do not share that breadth. Since many programs in high-level languages may be readily translated from, say, GAUSS to MATLAB, this breadth may be very welcome to the user.

3: The ReDIF metadata

From the material that we have covered in the previous section, we can draw a simple model of RePEc as

Many archives ===> One dataset ===> Many services

The term "RePEc" is initially an acronym; it stands for Research Papers in Economics. In fact the term should now to be a literal, because RePEc is about more than the description of resources. It is probably best to say that RePEc is a relational database about economics as a discipline.

One possible interpretation of the term "discipline" is given by Karlsson and Krichel (1999). They have come up with a model of the discipline, as consisting essentially of four elements arranged in a table:


	resource		collection

	person		institution

A few words may help to understand that table. A "resource" is essentially any output of academic activity: a research document, a dataset, a computer program, or anything else that an academic person would claim authorship for. A "collection" is a logical grouping of resources. For example the act of peer review may be represented by a resource being included in a collection. A "person" is a physical person or a corporate body who acts as a physical person in the context of RePEc.

These data collectively form a relational database that not only describes papers, but also the authors who write them, the institutions where the authors work, and so on. All this data is encoded in the ReDIF metadata format. I illustrate this in Subsection 3.2 and Subsection 3.3 for the institutional and the personal data, respectively.

3.1: A closer look at the contents

To understand the basics of ReDIF it is best to start with an example. Here is a--carefully selected--piece of ReDIF data at ftp://www.econ.surrey.ac.uk/pub/RePEc/sur/surrec/surrec9601.rdf: (We suppress the Abstract: field to conserve space.)

Template-Type: ReDIF-Paper 1.0
Title: Dynamic Aspect of Growth and Fiscal 
 Policy
Author-Name: Thomas Krichel 
Author-Person: 
 RePEc:per:1965-06-05:thomas_krichel
Author-Email: T.Krichel@surrey.ac.uk 
Author-Name: Paul Levine 
Author-Email: P.Levine@surrey.ac.uk 
Author-WorkPlace-Name: University of Surrey
Classification-JEL: C61; E21; E23; E62; O41 
File-URL: ftp://www.econ.surrey.ac.uk/pub/
 RePEc/sur/surrec/surrec9601.pdf 
File-Format: application/pdf
Creation-Date: 199603 
Revision-Date: 199711 
Handle: RePEc:sur:surrec:9601

When we look at this record, the ReDIF data resembles a standard bibliographical format, with authors, title etc.. The only thing that appears a bit mysterious here is the "Author-Person" field. This field quotes a handle that is known to RePEc. This handle leads to a record maintained at ftp://netec.mcc.ac.uk/pub/RePEc/per/pers/RePEc_per_1965-06-05_THOMAS_KRICHEL.rdf: (We leave out a few fields to conserve space.)

Template-Type: ReDIF-Person 1.0 
Name-Full: KRICHEL, THOMAS 
Name-First: THOMAS 
Name-Last: KRICHEL 
Postal: 1 Martyr Court
 10 Martyr Road 
 Guildford GU1 4LF 
 England
Email: t.krichel@surrey.ac.uk
Homepage: http://openlib.org/home/krichel
Workplace-Institution: RePEc:edi:desuruk
Author-Paper: RePEc:sur:surrec:9801
Author-Paper: RePEc:sur:surrec:9702
Author-Paper: RePEc:sur:surrec:9601
Author-Paper: RePEc:rpc:rdfdoc:concepts
Author-Paper: RePEc:rpc:rdfdoc:ReDIF
Handle: RePEc:per:1965-06-05:THOMAS_KRICHEL

In this record, we have the handles of documents that the person has written. This record will allow user services to list the complete papers by a given author. This is obviously useful when we want to find papers that one particular author has written. It is also useful to have a central record of the person's contact details. This eliminates the need to update the relevant data elements on every document record. In fact the record on the paper template may be considered as the historical record that is valid at the time when the paper was written, but the address in the person template is the one that is currently valid.

In the person template, we find another RePEc identifier in the "Workplace-Institution" field. This points to another record at ftp://crefe.dse.uqam.ca/pub/RePEc/edi/inst/desuruk.rdf that describes the institution:

Template-Type: ReDIF-Institution 1.0 
Primary-Name: University of Surrey
Primary-Location: Guildford
Secondary-Name: Department of Economics
Secondary-Phone: (01483) 259380
Secondary-Email: economics@surrey.ac.uk
Secondary-Fax: (01483) 259548
Secondary-Postal: Guildford, Surrey GU2 5XH
Secondary-Homepage: 
 http://www.econ.surrey.ac.uk/
Handle: RePEc:edi:desuruk

It would take us too far here to discuss this record in more detail. It is probably more interesting to know where these records come from.

3.2: Institutional registration

The registration of institutions is accomplished through the EDIRC project. The acronym stands for "Economics Departments, Institutions and Research Centers". This dataset has been compiled by Christian Zimmermann, an Associate Professor of Economics at Unversité du Québec à Montréal on his own account, as a public service to the economics profession. The initial intention was to compile a directory with all economics departments that have a web presence. Since there are many departments that have a web presence now, a large number are now registered, about 6,000 of them at the time of this writing. All these records are included in RePEc. For all institutions, data on their homepage is available, as well as postal and telephone information. For some, there is even data on their main area of work. Thus it is possible to find a list of institutions where--for example--a lot of work in labor economics in being done. At the moment, EDIRC is mainly linked to the rest of the RePEc data through the HoPEc personal registration service. Other links are possible, but are rarely used.

3.3: Personal registration

HoPEc has a different organization from EDIRC. It is impossible for a single academic to register all persons who are active in economics. One possible approach would be to ask archives to register people who work at their institution. This will make archive maintainers' work more complicated, but the overall maintenance effort will be smaller once all authors are registered. However, authors move between archives, and many have work that appears in different archives. To date, there is no satisfactory way to deal with moving authors. For this reason, the author registration is carried out using a centralized system.

A person who is registered with HoPEc is identified by a string that is usually close to the person's name and by a date that is significant to the registrant. HoPEc suggests the birth date but any other date will do as long as the person can remember it. When registrants works with the service, they first supply some personal information. The data that is requested is mainly the name, the URL of the registrant's homepage, and the email address. Registrants are free to enter data about their academic interests--using the Journal of Economic Literature Classification Scheme--and the EDIRC handle of their primary affiliation.

When the registrant has entered this data, the second step is to create associations between the record of the registrant and the document data that is contained in RePEc. The most common association is the authorship of a paper. However, other associations are possible, for example the editorship of a series. The registration service then looks up the name of the registrant in the RePEc document database. The registrant can then decide which potential associations are relevant. The authentication methods are weak. HoPEc relies on honesty.

There are several significant problems that a service like HoPEc faces. First, since there is no historical precedent for such a service, it is not easy to commonicate the raison d'être of the service to a potential registrant. Some people think that they need to register in order to use RePEc services. While this delivers valuable information about who is interested in using RePEc services--or more precisely who is too dumb to grasp that these services do not require registration--it clutters the database with records of limited usefulness. Last but by no means least, there are all kinds of privacy issues involved in the composition of such a dataset.

To summarize, HoPEc provides information about persons' identity, affiliation and research interests and links these data with resource descriptions in RePEc. This allows to identify persons and update their metadata in a timely and cost efficient way. These data could also fruitfully be employed for other purposes, such as maintaining membership data for scholarly societies or for lists of conference participants. It is hoped that the HoPEc data will be used as a shared pool of common personal data. After 15 months online, the registration service has been quite a success. Around 10\% of all documents in RePEc now have at least one registered author.

4: Providing incentives for the collection of data

There are many fine plans for building scholarly digital libraries, but only arXiv and RePEc are populated with contents that is worth looking at. The collection of contents remains a crucial social problem that no computer technique can solve.

RePEc has been particularly successful in attracting contents, in fact it has been extraordinarily successful give the small total financial subsidy that has been spent on it. At the outset in 1992 there were no electronic papers available. The collection started with making available--via gopher and WAIS--a set of acquisitions data for economics working papers collected by Fethy Mili, the head of the documentation center in the economics department of the Université de Montréal. The collection of electronic papers was pump-primed by collecting such papers directly from friends and net-acquaintances and through "snarfing" remote papers, i.e. collecting metadata records for them in the format used by the project. Very few institutions have submitted metadata records for conversion and storage by the project. The US Federal Reserve Banks are a notable example. In a second phase, other institutions have opened RePEc archive. In a third phase--which has not started yet--authors will be able to submit papers directly to RePEc through the HoPEc service. To date, authors have not been much involved in collecting data about their own papers. To get authors involved requires good incentives. An important component of that is to demonstrate to authors that papers submitted to RePEc have good visibility. This demonstration must be conducted with figures at hand; after all, we are talking to scientists here, many of whom rely heavily on statistics to corroborate economic theories.

A pioneering effort in the collection of statistics through log data is LogEc. This service is the work of Sune Karlsson. LogEc is based on a script that analyses web log data generated by RePEc user services. They collect data on the access of paper abstracts and the downloading of full-text. For the latter, redirections are used. Each service customizes the log analysis script. When the script runs, it collects data from the logs and stores it on the LogEc site. The LogEc service picks up the files from the different services to generate cross-service usage logs. It can then derive statistics on the most well-hit series of document be they working paper series or journals. More importantly, it also evaluates authors that have registered papers with the HoPEc service. For these authors, we can present all their log data in one place. Even more importantly, we can rank authors, according to the amount of hits they receive. This is a very effective strategy to attract the attention of authors. A recent email from Christian Zimmermann to HoPEc registered authors, that included ranking data for author, gathered from the LogEc service, resulted in record numbers of registration updates, as authors scramble to add the latest papers of theirs that have appeared in RePEc to the list of papers that they have written.

There is some general lesson here. To really get academic self-description going, one must exploit the way the academic "animal" thinks and acts. While academics are very sophisticated in the developments of their "products", i.e., the papers, books etc, that they produce, their behavior in evaluation is quite simple. They are very anxious for signs that people have been reading their papers, or that some reviewers found them worthy of inclusion in a peer-reviewing channel. A successful academic digital library must exploit this behavior for efficient collection of contents.

5: The Socionet user services

The Socionet user services are designed for Russian language users. Socionet is based on full RePEc database (the same as IDEAS, EconPapers and other RePEc mirrors) and also include a number of Russian language series (located in archive RePEc:rus). They have a special focus on visualization of a content of new additions to the RePEc Socionet helps users to cope with the large amounts new data, that arrive in the RePEc dataset on a daily basis, in following ways:

it builds a table of series with new additions and modified records for last 24 hours and also show a graph of new additions numbers for the last 30 days with links to tables of new additions for appropriate days;
it puts into all tables of contents (for archives, series, and documents) red color marks that show items changed today or during the last thirty days;
a user can specify his/her personal information robot (i-robot) for automated controlling contents of input data flows.

The last service (i-robot) makes a customization of data base content (excluding not relevant to user's interests archives and series) and filtration of input flow of new additions by user's profile of interests. I-robot creates reports with specified regularity and delivers it by e-mail and/or as static web pages.

Additionally Socionet includes:

Open Online Archives for publishing research materials in electronic form;

a personal manager of ReDIF archives and series for creating and managing (on local computer under the MS Windows operating system) collections of electronic materials described in the ReDIF format.

The Socionet services are built as implementation of ideas to develop general RePEc concept from the initial state as "global electronic catalog" to an integrated digital information environment for commonity of social scientists. Its construction, the Internet technologies that is uses and its user services allow easy and flexible modifications of the database structure in a decentralized manner according to the natural development of commonity needs. As a result to the end of 2001 the Socionet data base includes materials from 6 social science disciplines (economics, sociology, political sciences, demography, law, and psychology). Compared with common RePEc-based datasets (e.g. at IDEAS) Socionet has also some additional template types. Currently there is a "scheme", we will soon have soon "news" and "request". These additional types reflect the general ability of this system to integrating different types of information activity of commonity members into a common database.

The second set of ideas that drives Socionet service development can be called the "unlocking of information circulation". It means the Socionet service should help users not only to access the publications they need for their research; it should also assist in putting back their research results into the same common information environment. Current implementation of these ideas includes running at the Socionet site the Open Online Archives and, as an alternative system the Personal Manager of ReDIF archives and series, socioARM. If users publish their paers through these two systems, the publication will appear in Socionet data base. Since these two systems have good integration with other parts of Socionet--the personal manager socioARM will be fully integrated in its next release--it allows automated tuning and synchronization between personal tools for publishing and current configuration of the Socionet data base structure and available services.

The third idea behind the Socionet development is a "deeper utilization of information materials". Researchers and educators should have more ability to use or reuse materials from Socionet data base in different ways. For example, they should be able to build thematic sets or sections of related publications from the Socionet data base. These secondary series and collections can be uses as reference materials for students, or may be an expression of peer review.

6: Conclusions

RePEc is a complete self-documentation system for the discipline of economics. At the moment, it is almost entirely powered by volunteers. This is a considerable organizational achievement. Unfortunately, since the dataset is constant need of attention, there is no end of the work for the volunteers. We are soon reaching ten years of the work on building this database. It is hoped that generations of researchers will take care of it.

The work of the volunteers of RePEc reaches has impact around the globe. Its electronic publishing activity is particularly important to the middle and low income countries, where most institutions are too poor to afford access to conventional scholarly journals. Many papers that are available through RePEc can be downloaded at no charge. Since the whole of the RePEc dataset is free, it is hoped that the project will promote the ideal of a free flow of information. RePEc fights the division of the world between the informationally rich and the informationally poor.

The work discussed here has received financial support by the Joint Information Systems Committee of the UK Higher Education Funding Councils through its Electronic Library Programme., Education, Media and Culture Program of the Ford Foundation