The lyak1.pl script finds 14,032 unformatted abstract files in the arXiv.org collection. Lyak1.pl converts this data to the following format ("lyak-records"):
paper-code <internal debugging indicator>
SOURCE: untagged Title+Authors+etc. field from abstract file,
possible continuation lines starts with a space
Title: The lyak1.pl-proposed Title field,
possible continuation lines starts with a space
Authors: The lyak1.pl-proposed Authors field,
possible continuation lines starts with a space
<EMPTY LINE as a record separator>
These data are contained in the file "pre-man.txt".
A command line running lyak1.pl at openlib.org looks like this:
find ~/data/xxx/ -name \*.abs | xargs cat | ./lyak1.pl > pre-man.txt
The lyak-records were manually tested and edited if necessary by Tanya. She corrected 3,160 of total 14,032 records, which makes about 22.5%. The result was stored in the file "post-man.txt". Victor created the file "mod.txt" of lyak1-generated records, which gone under manual editing, and inspected carefully a significant part of it. All the manual corrections of lyak1.pl script's proposition appeared to be justified.
The expected share, according to the agreement text, of manually corrected records was 10%. The share of manually fixed records is therefore greater than expected. Victor can explain--but not justify--this great difference between the result and the estimation:
Lyak2.pl script processes the post-man.txt file and produces the hash-file "tac.db" with paper-code as key and {Title, Authors, Comments, Journal-ref, Subj-class} lines as value.
Lyak2.pl outputs a diagnostic message for every record it can't handle:
<Reason> paper-code <tech.info>
S: <source untagged field>
T: <Title>
A: <Authors list>
J: <Journal-ref>
L: <Subj-Class>
Optionally (-v) a text dump of hash-file is done in the same format. The command line to run lyak2.pl looks like this:
./lyak2.pl [-v] < post-man.txt > diag.txt
Most part of rejected records lack either Title or Authors fields, thus being a subject of correcting the source abstract files.
Lyak2.pl could not process some records with disjoint Authors field, even after manual correction, the typical template follows:
SOURCE: title author-name {paper-relevant info} author-workplace
Title: title
Author: author-name author-workplace
There are 110 rejected records in total.
All the files mentioned above are placed at ftp://openlib.org/pub/openlib/effingham