Report on the Effingham Agreement

Thomas Krichel, Victor M. Lyapunov and Tanya I. Yakovleva

2000-04-04

The lyak1.pl script finds 14,032 unformatted abstract files in the arXiv.org collection. Lyak1.pl converts this data to the following format ("lyak-records"):

paper-code <internal debugging indicator>
SOURCE: untagged Title+Authors+etc. field from abstract file, possible continuation lines starts with a space
Title: The lyak1.pl-proposed Title field, possible continuation lines starts with a space
Authors: The lyak1.pl-proposed Authors field, possible continuation lines starts with a space
<EMPTY LINE as a record separator>

These data are contained in the file "pre-man.txt". A command line running lyak1.pl at openlib.org looks like this:
find ~/data/xxx/ -name \*.abs | xargs cat | ./lyak1.pl > pre-man.txt

The lyak-records were manually tested and edited if necessary by Tanya. She corrected 3,160 of total 14,032 records, which makes about 22.5%. The result was stored in the file "post-man.txt". Victor created the file "mod.txt" of lyak1-generated records, which gone under manual editing, and inspected carefully a significant part of it. All the manual corrections of lyak1.pl script's proposition appeared to be justified.

The expected share, according to the agreement text, of manually corrected records was 10%. The share of manually fixed records is therefore greater than expected. Victor can explain--but not justify--this great difference between the result and the estimation:

the lyak1.pl considers list of authors follows immediately after the title. A lot of source records, however, contain some publication-relevant data between the title and the author list. Lyak-1 glues this data to the 'title' field, thus the latter needs cleaning.

the same situation appears at the end of the author list. The script often glues part of author-irrelevant data to its end.

While doing the estimation in autumn 1999, Victor concentrated on the correct location of the beginning of the author list. This estimation happened very close to the result: the beginning of author list was located incorrectly in 1,426 of 14,032 records, which makes ~10.2% .

Lyak2.pl script processes the post-man.txt file and produces the hash-file "tac.db" with paper-code as key and {Title, Authors, Comments, Journal-ref, Subj-class} lines as value.

Lyak2.pl outputs a diagnostic message for every record it can't handle:

<Reason> paper-code <tech.info>
S: <source untagged field>
T: <Title>
A: <Authors list>
J: <Journal-ref>
L: <Subj-Class>

Optionally (-v) a text dump of hash-file is done in the same format. The command line to run lyak2.pl looks like this:

./lyak2.pl [-v] < post-man.txt > diag.txt

Most part of rejected records lack either Title or Authors fields, thus being a subject of correcting the source abstract files.

Lyak2.pl could not process some records with disjoint Authors field, even after manual correction, the typical template follows:

SOURCE: title author-name {paper-relevant info} author-workplace
Title: title
Author: author-name author-workplace

There are 110 rejected records in total.

All the files mentioned above are placed at ftp://openlib.org/pub/openlib/effingham