This is the Turin paper. It describes an extension of the presorting in ernad. The paper has to be read in conjunction with the Altai paper. I am grateful for comments by by Joanna P. Davies
This is the version of 2009‒12‒11. Other versions available
This is what the Altai paper says about feature extraction
The svm_light data is approximately created as follows. The abstracts, author names, classifications, keywords, series and titles are used. The values of these field are concatenated and all punctuation is removed. Then all values are converted to lowercase. This results in a list of features. We count the number of appearance of each feature within a paper as its weight. We normalize every paper using an Euclidean distance of one. Then this data is prepared according to the syntax rules of svm_light.
There are no multi-word features.
The Turin project:
Software that is external to ernad collects text files with ".txt"
extension in in home/var/keywords, where
home is the ernad home directory, usually
/home/ernad/ernad. The files contain keyword
expressions separated by separators such as comma and/or semicolon,
as they appear in the orginal source data. How these text files get
there is beyond the scope of the Turin project.
Project software build_slovar processes the keywords
into a processed file called the slovar. It is kept in
home/var/polygrams, with file names
slovar_tist.txt, where
tist is the time stamp of the end of run of the slovar
creation. The file contains polygrams, one polygram in each line,
ordered by occurrence frequency in the expression data.
In order to prepare for presorting, a cache is built that stores the
occurrences of the polygrams in an allport issue. This is also stored
in the home/var/polygrams, with file names
polygrams_allport_date_slovar_tist.txt.
Here allport_date is the date of the allport issue,
slovar_tist is the tist of the slovar used. Each single
file is known as “polygrammer”. Lines in the polygrammer are of the
form
paper_numberpolygram_number:occurences
Here, paper_number in the number of the paper in the allport issue,
polygram_number is the line number of the polygram in
slovar of corresponding tist, and occurences is the number
of occurences of the polygram in that paper. All numbers start with
1, rather than 0. A sample line would be: 3 7:1 67:2 2090:1
13901:2. In principle, the first number is not necessary, but
it will look more coherent if the rest of the line is empty.
Project software build_polygram_cache builds the
polygram occurence cache for currently relevant allport issues. The
software automatically finds which caches are still to build, using
the tist of the latest slavar, the currently available allport RIFs
and the experience length from the ernad configuration.
A trivial extension to the ernad presorting.pl uses the features of the processed sloar like individual features. We introduce this extension as an addition to the presorting process. After normal presorting is run, RIFs are presorted again, with three changes.
/var/reports/report/svm.
/var/reports/report/source/ns.
As a result, runs of presorting will last at least twice as long.
We hope that the calculations can be done on a separate machine but that can not be guaranteed to be available.
For the course work in the first year, a discussion of the problem, and a script that implements the creation of the polygrammer will be sufficient. The efficient creation of the polygrammer is the main technical hurdle when the number of polygrams is large.
The dissertation will review the evidence on the improvement of presorting. A large part of the work will be to survey the evidence gathered.
To measure the improvements in presorting use the “nosel” measure, described in section 4.1 of the Sendai paper, cited as published in 2007 in “Information Processing and Management”, vol. 43, pp. 1030–1043. This implies that measures are only coherent within a single nep-all issue and a single report.
The student will independently develop ways to visualize the results.