Turin paper

by Thomas Krichel

Status

This is the Turin paper. It describes an extension of the presorting in ernad. The paper has to be read in conjunction with the Altai paper. I am grateful for comments by by Joanna P. Davies

This is the version of 2009‒12‒11. Other versions available

2008‒05‒24

Introduction

This is what the Altai paper says about feature extraction

The svm_light data is approximately created as follows. The abstracts, author names, classifications, keywords, series and titles are used. The values of these field are concatenated and all punctuation is removed. Then all values are converted to lowercase. This results in a list of features. We count the number of appearance of each feature within a paper as its weight. We normalize every paper using an Euclidean distance of one. Then this data is prepared according to the syntax rules of svm_light.

There are no multi-word features.

The Turin project:

changes prepossessing code so that it uses libSVM instead of SVM_light. This will be done together with Thomas Krichel.
introduces multi-word features (henceforth: polygrams) into a new version of presorting.pl
and tests to evaluate how much they will improve presorting.

Introducing multi-word features

Software that is external to ernad collects text files with ".txt" extension in in home/var/keywords, where home is the ernad home directory, usually /home/ernad/ernad. The files contain keyword expressions separated by separators such as comma and/or semicolon, as they appear in the orginal source data. How these text files get there is beyond the scope of the Turin project.

Project software build_slovar processes the keywords into a processed file called the slovar. It is kept in home/var/polygrams, with file names slovar_tist.txt, where tist is the time stamp of the end of run of the slovar creation. The file contains polygrams, one polygram in each line, ordered by occurrence frequency in the expression data.

In order to prepare for presorting, a cache is built that stores the occurrences of the polygrams in an allport issue. This is also stored in the home/var/polygrams, with file names polygrams_allport_date_slovar_tist.txt. Here allport_date is the date of the allport issue, slovar_tist is the tist of the slovar used. Each single file is known as “polygrammer”. Lines in the polygrammer are of the form
paper_numberpolygram_number:occurences
Here, paper_number in the number of the paper in the allport issue, polygram_number is the line number of the polygram in slovar of corresponding tist, and occurences is the number of occurences of the polygram in that paper. All numbers start with 1, rather than 0. A sample line would be:
3 7:1 67:2 2090:1 13901:2. In principle, the first number is not necessary, but it will look more coherent if the rest of the line is empty.

Project software build_polygram_cache builds the polygram occurence cache for currently relevant allport issues. The software automatically finds which caches are still to build, using the tist of the latest slavar, the currently available allport RIFs and the experience length from the ernad configuration.

A trivial extension to the ernad presorting.pl uses the features of the processed sloar like individual features. We introduce this extension as an addition to the presorting process. After normal presorting is run, RIFs are presorted again, with three changes.

File require libSVM (training, testing, result) files are stored in home/var/reports/report/svm.
The resulting presorted RIFs are stored in home/var/reports/report/source/ns.
and of course, multi-word features are included.

As a result, runs of presorting will last at least twice as long.

We hope that the calculations can be done on a separate machine but that can not be guaranteed to be available.

Student dissertation

For the course work in the first year, a discussion of the problem, and a script that implements the creation of the polygrammer will be sufficient. The efficient creation of the polygrammer is the main technical hurdle when the number of polygrams is large.

The dissertation will review the evidence on the improvement of presorting. A large part of the work will be to survey the evidence gathered.

To measure the improvements in presorting use the “nosel” measure, described in section 4.1 of the Sendai paper, cited as published in 2007 in “Information Processing and Management”, vol. 43, pp. 1030–1043. This implies that measures are only coherent within a single nep-all issue and a single report.

The student will independently develop ways to visualize the results.