Turin paper

by Thomas Krichel

Status

This is the Turin paper. It describes an extension of the presorting in ernad. The paper has to be read in conjunction with the Altai paper.

This is the version of 2008-05-24.

Introduction

The current version of ernad uses feature extraction, as noted in the Altai paper.

The svm_light data is approximately created as follows. The abstracts, author names, classifications, keywords, series and titles are used. The values of these field are concatenated and all punctuation is removed. Then all values are lowercased. This results in a list of features. We count the number of appearance of each feature within a paper as its weight. We normalize every paper using an Euclidean distance of one. Then this data is prepared according to the syntax rules of svm_light.

There are no multiword features.

The Turin project:

introduces multiword feature into a new version of presorting.pl;
changes preprocessing code so that it uses libSVM;
and tests them to evaluate how much they will improve presorting.

Introducing multiword features

Project software renew_raw_slovar_file collects the expressions in the <keywords> of the entire RePEc dataset. The result is a file called the raw slovar. It is kept in ~/ernad/opt/var/slovar.

Project software refine_slovar preprocesses the slovar into a processed file called the refined slovar. It is kept in ~/ernad/ernad/opt/var/slovar.txt. The project will test various ways to do it.

A trivial extension to the ernad presorting.pl uses the features of the processed slovar like individual features. We introduce this extension as an addition to the presorting process. After normal presorting is run, rifs are presorted again, with three changes.

libSVM files are stored in ernad/var/reports/report/opt/svm.
resulting presorted rifs are stored in ernad/var/reports/report/source/ns.
and of course, multi-word features are included.

As a result, Presorting.pm will last at least twice as long.

We hope that the calculations can be done on a separate machine but that can not be guaranteed to be available.

Student dissertation

The title will be "Learning subjects with multi-word features".

The dissertation will review the evidence on the improvement of presorting.

Initially, improvements in presorting use the simple criterion of position of last relevant paper as included in the sent report, divided by the size of the allport. If measurements using this criterion are inconclusive, a more elaborate criterion that also takes account of the position of other papers, will be included.

There should be some comments on general problems of multiword extraction (which are solved here). This includes a brief description of this paper, and some of this paper and some of the papers that cite it

Parisut Jitpakdee , Worapoj Kreesuradej, Dimensionality reduction of features for text categorization, Proceedings of the third conference on IASTED International Conference: Advances in Computer Science and Technology, p.506-509, April 02-04, 2007, Phuket, Thailand
Liang Chen , Naoyuki Tokuda , Akira Nagai, A new differential LSI space-based probabilistic document classifier, Information Processing Letters, v.88 n.5, p.203-212, 16 December 2003
Trevor Mansuy , Robert J. Hilderman, A characterization of wordnet features in Boolean models for text classification, Proceedings of the fifth Australasian conference on Data mining and analystics, p.103-109, November 29-30, 2006, Sydney, Australia
Francois Paradis , Jian-Yun Nie, Contextual feature selection for text classification, Information Processing and Management: an International Journal, v.43 n.2, p.344-352, March 2007
Shi-Zhu Liu , He-Ping Hu, Text classification using sentential frequent itemsets, Journal of Computer Science and Technology, v.22 n.2, p.334-337, March 2007
Laurence Hirsch , Robin Hirsch , Masoud Saeedi, Evolving Lucene search queries for text classification, Proceedings of the 9th annual conference on Genetic and evolutionary computation, July 07-11, 2007, London, England
Roman Tesar , Vaclav Strnad , Karel Jezek , Massimo Poesio, Extending the single words-based document model: a comparison of bigrams and 2-itemsets, Proceedings of the 2006 ACM symposium on Document engineering, October 10-13, 2006, Amsterdam, The Netherlands
Güneş Erkan, Language model-based document clustering using random walks, Proceedings of the main conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics, p.479-486, June 04-09, 2006, New York, New York
Jun'Ichi Kazama , Jun'Ichi Tsujii, Maximum Entropy Models with Inequality Constraints: A Case Study on Text Categorization, Machine Learning, v.60 n.1-3, p.159-194, September 2005