Helsinki document

Thomas Krichel and Ivan Kurmavov

2006–01–04

This document is the Helsinki document. Its initial version was written by Thomas Krichel on 2003–08–31. Thomas Krichel and Ivan Kurmavov met 2003–09–02 to 2003–09–05 to discuss this document. We are grateful for the excellent hospitality of Eugenia G. Stupina during the meeting. This is a re-issue of the document for use in definition of requirements of phase 2 of ACIS.

1. Background on citation parsing

Citation parsing is the interpretation of citations in conventional scholarly papers. By "conventional" scholarly papers we mean papers that are not networked in an obvious way, i.e. through clickable hyperlinks. In such conventional papers, links between papers appear as references. These are plain short strings. At the end of the paper, or in footnotes, these short strings, are resolved into longer strings, called citations, that contain descriptive metadata about the item being referred to.

2. Autonomous citation parsing means to run a computer program that finds citations and references in a paper, and parses them into metadata components about the object being referred to. This process is entirely conducted by computer. "First-generation" citation parsing, as pioneered by CiteSeer, and also implemented in CiteBase, aims to parse the string into metadata components by just looking at the string on its own.

It should be possible to implement "second-generation" citation parsing. Such parsing will have to be based in part, on intelligence that has been provided by humans. Examples are

Find journal names in the citation data from a dataset about scholarly journals and the way that they are being abbreviated.
Find author data through parsing citation using known author name.

This "usual suspects" list of authors can be provided by ACIS stage one.

2. Background on ACIS

ACIS is a project funded by the Open Society. Its basis is the Montréal document.

The ACIS software has two objectives

It serves the authors' vanity.
It improves bibliographic data.

Both objectives are intimately related. By serving the vanity, it extracts labor from the authors that improves the bibliographic data. If it was not for the improvement of the bibliographic data, the service to the authors' vanity would remain a private benefit to them that would not justify the efforts we put into creating ACIS.

The Montréal document we wrote the following about stage 2

37. In stage two, the project will be extended from the authorship of documents to the authorship of citations contained in other documents. The system will scan citation data for the occurrence of the name of an author, and ask two questions. First, is this you who is being cited in this paper? Second, is this paper part of your research profile, i.e. the list of papers that is already available? We know that authors are very interested in obtaining data on citations to their works. The classic citation index offered by the Institute for Scientific Information is severely deficient in this respect because it indexes only the first author.

For PhysNet and RePEc, citation data is available through the Open Citation and CitEc projects, respectively. For rclis, citation data could be gathered through collaboration with CiteSeer, but for the moment this is out of the scope of the proposal. It is an option that will have to be studied. The ACIS project will fund the conversion of metadata provided by the citation data sources to a common subset of the Academic Metadata Format, that will be used for input into the database.

ACIS will export the value-add citation data for usage by the contributing citation indices. A precise way of doing this will have to be agreed between participants.

For those authorities where the visibility of services using the contributions profile is low, the integration of citation data provides an alternative way to attract contributors to an ACIS service. Authors have a strong desire to find out

how many times they have been cited,
which papers are citing their work,
which of their own papers have made the most impact as measured by citations

This is the first and most important motivation of integrating citations data into ACIS.

A second motivation for citation data in ACIS is to aid automated citation parsing. Manually maintained authorship data can be very useful to complement citation parsing.

There are two things that need to be clarified before we can proceed to integrate citations into the ACIS.

We need to set out how we introduce the data to the users, what facilities we offer etc. These are the "presentational" issues.
We need to set out how we read the data into the ACIS and how we write it out. These are the "representational" issues.

A common challenge to both is the issue of how to deal with citations, in general.

2: Inside and outside citations

There are at least two types of bibliographic data in an ACIS based system.

There are authoritative data. These are data the authority has built from trusted bibliographic sources.
There are citation data. These are appearing in document bodies. The authority holds authoritative data for the documents that the citations appear in.

The documents described by the citation data may or may not be covered by the authoritative data. If a citation describes a document with authoritative data, we call it an inside citation. If the citation describes a document outside of the authoritative data, it is called on outside citation. Initially we have no way to distinguish if a citation is inside or outside.

12. For the user, the difference between inside and outside citation is not always clear. Hiding the difference, or making it obsolete for the user to comprehend it, is important to stage two of ACIS. Therefore the ACIS implementation is only going to allow users a to perform a small number of tasks.

3: Presentation

A stage two ACIS service has a citation profile. Users will manage it on the "Citations profile" screen. Initially and upon later user request, the screen will perform a search for citations.

15. To find citations for a person, we use the name variations data of the personal profile. Every time a user changes the name variations profile, a new search for citations will be performed as soon as the user returns to the citations profile. By default, citation search only looks for exact matches of the name variations.

16. For better name recognition in citations, we need to have approximate string matching. Doing this on the citations data may be too slow to do it online. In that case we will do it off-line. But if it is not too slow, we will give user a choice between exact and approximate matching. This is an item to investigate further.

Working with the citation profile, users can perform four operations. These are

verification
identification
document item creation
revelation

Verification comes first. Then there are iterations on identification and document item creation. Revelation can be performed at any time.

Verification happens when users tell us that they authored the document described in the citation. This can be done with a check box next to the citation string. When this step is finished, the software saves the non-verified citations as refused. Therefore, at a new citation search, the already refused items will be discarded before they reach the users.

One special problem with verification stage is that citations do not not necessarily contain the description of only one work. Consider, for example "A.U. Thor "A work" in E. I. Ditor (ed.) "A collection"". In such cases, the "in E. I. Ditor (ed.) "A collection"" part is considered to be location information. In other words, the previous paragraph, we are quite exact to require authorship of documents. Users can not associate with the collections in which the cited work appears. This restriction is imposed for reasons of simplicity of the data model and interface. It must be communicated to the users.

We will device an simple algorithm that will find if there is is a similarity between a citation and a document. If a citation can be found that it is similar to a document, we will say that it is an "interesting" document. If one document has more citations that are similar to it than another document, it is called a "more interesting" document.

In the identification stage, we first proceed by documents. We consider all document in the contribution profile by decreasing level of interest for any new citation found. Each document page has a representation of the document and a list of all new similar citations. Each new similar citation has a check box on the left that is checked by default. If users uncheck the box, they tell us that this citation is not for document that they examine.

Below the lists of proposed new citations to a document, there is the list of already associated citations. This list will be clearly separated from the list of new citations. It will also have check boxes next to each citation. The boxes will appear on the right and will be checked by default. Unchecking them removes the association between citation and document.

The identification stage will leave us with citations that either are not similar to any of the documents, or, the similarity of which has been discarded by the users. In the document creation stage users can create new documents for any of the citations. Thus, at the document creation stage, we list all those unassociated citations. Each of these citation has a button "new document".

If the users follow the "new document" button, there will be a way for them to edit some document information about it. After much debate, we resolve that since the citation is an unparsed string, users will only be able to edit the string. Ivan still thinks that adding a structured editing capability will be better.

The separate fourth step, that can be performed at any time, is revelation. For each paper in the citation profile, revelation will show the users which source documents cite that paper. This can be done in a separate link for each citation, because this is a step that can be performed on each citation in isolation.

4: Representational issues

An outside citation is identified by a collated form of its string representation. If the collated form of the string changes, any data created by ACIS for that citation will be lost. After verification, the representation within author's profile in AMF is

<person id=" id ">
    <isauthorof>
        <text>
            <iscitedby>
                <text ref=" ref ">
                    <reference>
                        <literal> text of citation </literal>
                    </reference>
                </text>
            </iscitedby>
        </text>
    </isauthorof>
</person>

This is ugly but it gets the job done. As an aside note, this AMF data would only be read by a specialized application; there is no reason to develop a general AMF parser.

27. After identification, the representation within author's profile in AMF is

<person id=" id ">
    <isauthorof>
        <text ref=" ref ">
            <iscitedby>
                <text ref=" ref "/>
            </iscitedby>
        </text>
    </isauthorof>
</person>

Note that the literal of the citation could still be there, but may be omitted to save space. If its collated representation changes, the association between the texts does not disappear. There is no ambiguity, because both the source and the target of the reference are identified.

Before accumulation, there are simply a range of texts in the profile. Each text appears on its own. In the unidentified case, for three citations

<person id=" id ">
    <isauthorof>
        <text>
            <iscitedby>
                <text ref=" ref ">
                    <reference>
                        <literal> text of citation one </literal>
                    </reference>
                </text>
            </iscitedby>
        </text>
    </isauthorof>
    <isauthorof>
        <text>
            <iscitedby>
                <text ref=" ref ">
                    <reference>
                        <literal> text of citation two </literal>
                    </reference>
                </text>
            </iscitedby>
        </text>
    </isauthorof>
    <isauthorof>
        <text>
            <iscitedby>
                <text ref=" ref ">
                    <reference>
                        <literal> text of citation three </literal>
                    </reference>
                </text>
            </iscitedby>
        </text>
    </isauthorof>
</person>

After accumulation, citations of the same text are grouped.

<person id=" id ">
    <isauthorof>
        <text>
            <iscitedby>
                <text ref=" ref ">
                    <reference>
                        <literal> text of citation one </literal>
                    </reference>
                </text>
            </iscitedby>
            <iscitedby>
                <text ref=" ref ">
                    <reference>
                        <literal> text of citation two </literal>
                    </reference>
                </text>
            </iscitedby>
        </text>
    </isauthorof>
    <isauthorof>
        <text>
            <iscitedby>
                <text ref=" ref ">
                    <reference>
                        <literal> text of citation three </literal>
                    </reference>
                </text>
            </iscitedby>
        </text>
    </isauthorof>
</person>

We do not have, at present, a way to mark up the head citation. We will assume that it is the first one.