Turin paper

by Thomas Krichel

Status

This is the Turin paper. It describes an extension of the presorting in ernad. The paper has to be read in conjunction with the Altai paper.

This is the version of 2008-05-24.

Introduction

The current version of ernad uses feature extraction, as noted in the Altai paper.

The svm_light data is approximately created as follows. The abstracts, author names, classifications, keywords, series and titles are used. The values of these field are concatenated and all punctuation is removed. Then all values are lowercased. This results in a list of features. We count the number of appearance of each feature within a paper as its weight. We normalize every paper using an Euclidean distance of one. Then this data is prepared according to the syntax rules of svm_light.

There are no multiword features.

The Turin project:

Introducing multiword features

Project software renew_raw_slovar_file collects the expressions in the <keywords> of the entire RePEc dataset. The result is a file called the raw slovar. It is kept in ~/ernad/opt/var/slovar.

Project software refine_slovar preprocesses the slovar into a processed file called the refined slovar. It is kept in ~/ernad/ernad/opt/var/slovar.txt. The project will test various ways to do it.

A trivial extension to the ernad presorting.pl uses the features of the processed slovar like individual features. We introduce this extension as an addition to the presorting process. After normal presorting is run, rifs are presorted again, with three changes.

As a result, Presorting.pm will last at least twice as long.

We hope that the calculations can be done on a separate machine but that can not be guaranteed to be available.

Student dissertation

The title will be "Learning subjects with multi-word features".

The dissertation will review the evidence on the improvement of presorting.

Initially, improvements in presorting use the simple criterion of position of last relevant paper as included in the sent report, divided by the size of the allport. If measurements using this criterion are inconclusive, a more elaborate criterion that also takes account of the position of other papers, will be included.

There should be some comments on general problems of multiword extraction (which are solved here). This includes a brief description of this paper, and some of this paper and some of the papers that cite it

Valid XHTML 1.0!