An author-driven identification system

by Thomas Krichel
on behalf of the Open Library Society

version of 2007–07–14

This proposal is a response to the “request for proposals: innovative uses of CrossRef metadata” as circulated by Amy Brand to the LibLicence Mailing list on 2007–06–13.

The proposal is structured as follows:

Sections 1 to 3 are the key sections. In section 1, we summarize the project. In section 2, we describe the motivation behind its creation and long-term vision and goals. In section 3, we make some suggestions for project management but we remain open to new ideas and approaches.

Sections 4 to 8 are essentially background reading aimed at those who want to know more details. Section 4 gives historical background. Section 5 introduces the ACIS software we use. Section 6 discusses project extensions that are already implemented in ACIS. Section 7 provides links to other works by Thomas Krichel. Section 8 discusses legal issues. Section 9 concludes with final remarks.

We are grateful for comments by Joanna P. Davies, William L. Goffe and David Goodman and John J. Regazzi.

1. Summary

We wish to use the CrossRef data to build an author registration service for academic authors. The service collects data from authors about themselves and their relationship to items described in the CrossRef metadata. The service provides an identifier for each author, along with author-verified affiliation and publication information.

When authors register with the service, they perform two steps. First, they furnish the service with personal information such as name, affiliation, personal web page, and an email address. Second, authors search the CrossRef metadata for records describing documents they have authored. When they find such documents, they inform the service. As a result, the service creates author profiles that link the authors' information with their publications across all platforms. To illustrate, here is a rough example of an author record:

`Name:`	`Thomas Krichel`
`Homepage:`	`http://openlib.org/home/krichel`
`Affiliation:`	`Long Island University`
`Email:`	`krichel@openlib.org`
`Author:`	`doi:10.1016/j.ipm.2006.09.017`
`Author:`	`doi:10.1016/0165-1765(92)90226-O`
`Handle:`	`1965-06-05_thomas_krichel`

2. Motivation

Author names are a notoriously difficult field of authority control. Although bibliographic data contain author names, a name does not necessarily identify an author uniquely. A number of factors contribute to the variety of ways author names appear in standard bibliographic data. Several authors may share the same name, a name can be written in various forms and the use of initials, accents or capitalizations may be inconsistent. Additional problems are introduced by the transliterations from non-Roman to Roman alphabets.

In an academic environment that has become increasingly influenced by metric-based evaluation of academic performance and institutional impact, publishers can create value by providing author identification. To add value to their services publishers and vendors could make individual efforts to provide author identification tools. However, these attempts will result in fragmented sets of authorship data.

It is desirable to use automated methods, such as those discussed at the CrossRef author id meeting to alleviate author identification problems. But with current and foreseeable technology, we cannot rely on computer recognition alone to identify authors. Only a human will be able to tell who, say, José Manuel Barrueco Cruz is. Therefore it is best to get authors themselves involved in identifying their own works. As they do this, they may also give us other useful information, such as affiliation data and homepage URLs. They could organize their citations, give us information about how different documents relate to each other etc.

In other words, we think that author identification is a matter for authors. The academic community has an important role in ensuring that the author records are correct. Publishers and academic employers help the process by making extensive use of author identification records.

To maximize incentives for authors to participate in this system, the personal records created by the system have to be widely disseminated. Therefore access to these records should be free. For privacy, the email addresses need to be protected.

Some of the incentives for publishers to contribute metadata to participate in the author registration service are:

Publishers can use personal data to link from an author name to the author's home page. Authors will appreciate this feature as an additional way to highlight their accomplishments.
Publishers can enrich their portals with links to other papers by the same author, using DOIs as a base for the linking process. If a publisher provides links to all papers by an author, the author can link to that page, with bibliographic information provided by the publishers.
Publishers can suggest that authors use their identifiers when they submit papers. It is possible to build systems when the author is suggested to type in a name fragment, and a submission system can ask “Is that someone we know?” and propose one or more matching author records from the personal data. This will make the collection of personal data on submission much easier.
Publishers can use personal data to build system that help with referee selection.
Publishers can use the data to send authors periodic usage reports about the usage of their papers through the publishers' portals. This is very important information for authors. A publisher who will be able to furnish such a piece of information.

The main incentive for authors to contribute is that the service provides them with a single point with the DOI data for all their formally published papers. They no longer have to collect that information on their own in various ways. In addition, any link that publishers create to the homepage of an author increases the author's visibility on the web.

From our experience, the uptake by authors will be slow initially, but very good in the long run. Our experience is based on the the RePEc Author Service. It was created by Thomas Krichel. It implements a service as proposed here for the economics discipline alone. It has excellent author uptake. This is documented in Section 4.

3. Elements of project management

There are really two aspects of this proposal. First we want to build a concrete prototype that will allow stakeholders to test an implementation without actually creating (or messing up) any serious records. This is a short-run project. Second we want to create a free system that allows academic authors to register claims to scholarly works. This is the long-run project.

3.1 Short-term project

We will build a prototype of the service using a subset of recent CrossRef metadata.

Since the software is written and tested. We have a small server machine available. Thomas Krichel is on sabbatical leave until September 2008. Therefore a basic testing version of the service can be created by Thomas Krichel working on his own. We estimate two months for this work. The testing version should contain about 1 million records.

This testing version will then be made available to publishers for comments on the look-and-feel of the interface. We are keen for the publishing industry to be comfortable with the site because publishers play a key role in encouraging the authors to sign up to the system.

The testing service will remain open after the launch of the real service because it is useful to test new features.

One a definitive version of the test service is in operation, we will build the definitive version of the service. This depends on having a server available that can take the load of the full dataset.

3.2 Long-term project

This proposal will implement a prototype of the service. How that service will run, under what conditions, and with which partners, will be discussed with the stakeholders. We hope for an effort that will be mutually beneficial to all.

In principle, this service has to be kept alive in perpetuity. The output cannot be sold. The question of whether it is possible to raise enough resources remains. However, historic evidence demonstrates that the principal applicant, Thomas Krichel, has in the past successfully built large academic digital libraries and services that continue to run without external subsidy.

The main social issue to be addressed is the uptake by the academic community. Although we predict a slow uptake of the service by authors initially, we do not view it as a problem but rather an expected transitioning period. In any case, we need to build a small team, mainly from the library community, that will respond to inquiries from authors. More authors registered means more enquiries. We will have to make sure that the capacity to respond to enquiries grows with the amount of authors registered.

4. Background and history

The present proposal has a rich history through Thomas Krichel's involvement with the RePEc digital library.

The background of RePEc are efforts made by Thomas Krichel to create an index to working papers and in Economics. This work goes back to 1993. Since 1994, José Manuel Barrueco Cruz has been Thomas Krichel's partner in this effort. RePEc was formally launched in 1997. At the time of writing, more than 720 providers contribute data to RePEc. Among the larger providers are commercial publishers such as Blackwell, Elsevier, Springer, and Taylor and Francis to name but a few. Participation in RePEc is free for all. The total stock of document data is close to 500,000 items. The whole operation runs without any external subsidies, with the efforts of its participants only.

In 1998, Thomas Krichel perceived the general idea to create an author registration service for the RePEc dataset about academic economics. Markus J.R. Klink was the first programmer on the project. In 2000, a paper on the service was presented in the European Conference on Digital Libraries. Since 2000, the service is known as the RePEc Author Service (RAS).

RAS uses the EDIRC dataset on economics institutions that Christian Zimmermann started to compile in 1995. At this time, there are more than 10,000 institutions registered in EDIRC.

As experience with this initial version of the RePEc author service revealed over time, the system had to be rewritten. Funds from the OSI allowed the development of ACIS, a general purpose software that implements author registration services.

At the time of writing, over 18,000 economists authors are registered with the RePEc author service. In total, these authors have claimed authorship of close to 200,000 documents in the RePEc database. Top authors, who have written a lot of papers, appear to dominate registrations. There is a top 1000 economists that shows that most of the important economists are registered.

5. The ACIS software

There is an existing software that implements author registration services as propose here. It is called the academic contribution information system, (ACIS). It was conceived by Thomas Krichel and written by Ivan V. Kurmanov. The development was supported by the Open Society Institute.

The ACIS software uses a name variations profiles. For each author, the name variations profile contains variations of the author's name. The profile is created when the author first registers with the service. ACIS makes some suggestions to the author, but the author can revise the suggestions. For example, the author may include a maiden name.

Authors can also claim documents that do not contain any variations of their name. For example, an author may wish to claim that she was one of the "et alii" that were omitted in the attribution. For that to happen, authors have to know the DOI of the paper that they want to claim.

At the heart of ACIS runs ACPU, the automated contribution update. The idea is that an author does not have to claim each piece of work written by logging on to an ACIS installation. Instead ACIS processes the name variations periodically and submits a list of potential new documents to the author for verification.

There is a test service of ACIS, at http://test.acis.openlib.org. Readers of the proposal are welcome to try it out. The data used comes from the RePEc economics database, so it is best to imagine that you are a famous economist.

6. Potential extensions

ACIS does not only implement author registration services in the strict sense, but some related services. We mention these here because these extensions are readily available. They are already coded in ACIS.

One such related service enablers dealing with citation data. ACIS not only allows authors to claim works, but also allows them to organize citations to these works. We understand that some citations data is available in the CrossRef metadata.

A second related service is related to the deposit of documents in document submission services. Such services are available when an author contacts a document submission service, such as available through publishers' web sites. At that stage, a submitting author has to enter the details about all authors of the paper. ACIS can be configured in such a way as to provide an author lookup services. When a submitter adds data about an author, the publishers web site can query an ACIS installation for known authors, and when an known author is found, the publisher's web site can inform the submitter. Once the submitter has found the correct author, all the information about that author can be used in the publisher's system and no further input is required.

A third service is that authors can create links between documents in the system. For example, an author can inform us that one of her papers in an extension of another paper.

7. Links to other developments

Thomas Krichel has been involved in a setup of other projects that levy decentralized efforts that build large-scale academic digital libraries and collections with no external subsidies. Main contributions are

The RePEc digital library for economics is the largest distributed academic digital library in the world. It was conceived and started by Thomas Krichel.
The NEP: New Economics Papers is the largest human meditated current awareness system for academic papers in the world. Thomas Krichel conceived the system and maintains its technical infrastructure.
The Academic Metadata Format is a simple metadata format that is used in ACIS.

8. Legal issues

Some innovative ideas can be killed by worries about copyright and privacy. Nevertheless, let us briefly look at some potential concerns.

The service does not directly expose any of the abstracting data that is protected by CrossRef. It makes use of author name data, DOI, and basic citation data such as journal name and year, that will allow an author to identify the article quickly, without following the DOI link. This data will not be accessible for further use. It will only be searchable. Therefore the running of the service imposes no threat of the proprietary data of CrossRef leaking to third parties.

By default, email addresses are not made publicly available. But the rest of the output metadata available to all. It can be stored as files on an ftp, public rsync, web server or made public using OAI-PMH. Initially the service will only provide a ftp service. Hopefully other parties will provide other interfaces.

During the eight years that RePEc author service has been running, making author data available has never raised a privacy concern. While there may be some theoretical legal problems with collecting such data, in practice we have never had an issue. Authors are very happy with the work, since it indirectly promotes their work.

9. Conclusions

In the age of full-text searching, improving basic abstracting and indexing data is becoming less useful. Author identification is an increasing useful because this information cannot be extracted by automated means.

This service will therefore not be a information service in the conventional sense, i.e. where a user's need for information is satisfied. Instead it will be a publishing service in the sense that it's users will make information about themselves public.

Our experience with author registration within the RePEc project suggests that working directly with authors is key to the development of a digital library. Having correctly identified authors will open new business opportunities for the publishing industry.