Eltz paper

by Thomas Krichel
First written on 2005-07-21
Last changed on 2006-07-02

Introduction

This paper it the Eltz paper. It is an attempt to write some documentation on matters konz, rclis. At the moment, it is organized by machine. I deal with wotan first, then with fasolt.

Wotan

Since fasolt is such a weak machine because of its disk space and cpu constraints wotan has to become more supportive of konz.

Prepare a new version of dblp amf at wotan

step 1

First, make sure you have the latest version of ~/public_html/amf_konz from fasolt. This is the directory that stores files containing AMF metadata that gives information about full text files that have been found via konz. This can be done with the following command

rsync -va --delete arclis@fasolt:public_html/amf_konz/ ~/public_html/amf_konz

The -v flag here sets rsync into verbose mode, so it will inform you of any changed file or directory. You can leave that flag out and just write -a, on -qa to set it in quiet mode. The --delete flag requests wotan to delete files in the local ~/public_html/amf_konz directory.

step 2

Although this is not strictly necessary for the preparation of the metadata, you should also make sure you have the latest version of ~/public_html/papers_konz from fasolt. This is the data that contains the actual full-text files that DoCIS will be linking to. If the files are not complete, DoCIS users will experience 404 errors. To get the files, use the command

rsync -va arclis@fasolt:public_html/papers_konz/ ~/public_html/papers_konz

The -v flag here sets rsync into verbose mode, so it will inform you of any changed file or directory. You can leave that flag out and just write -a, on -qa to set it in quiet mode. Note that we do not use the --delete flag here. If we have spare files, it will do us no harm, we have plenty of space on wotan.

Unfortunately, the operation will take time, because the bandwidth for outgoing traffic at fasolt is very small.

step 3

You have to get the DBLP data. Use

download_dblp

to download the DBLP data. This Perl script really simply calls up three shell command. Each command is echoed on the screen

lynx -source http://dblp.uni-trier.de/xml/dblp.xml.gz > /home/arclis/dblp/var/dblp.xml.gz
rm /home/arclis/dblp/var/dblp.xml
gunzip /home/arclis/dblp/var/dblp.xml.gz

The first command may take a long time because dblp is very large. If any of the three command fails, there is no need to continue. Unless, of course, you want to enter new konz data into an already existing DBLP data set. Note that the DBLP data is held as a single file in /home/arclis/dblp/var/dblp.xml.

step 4

You have to make sure that you have latest version of Izumi's dblp series data. This data use to be compiled on fasolt, but it should now be compiled on wotan. Your version, that you will be using to build DBLP is ~/rclis/dbl/dbl.amf.xml Check if Izumi has done a new version of dbl.amf.xml at ~izumi/rclis/dbl/dbl.amf.xml For example, you can issue the command

ls -l ~/rclis/dbl/dbl.amf.xml ~izumi/rclis/dbl/dbl.amf.xml

to see the times associated with both files. Or, you can look at the difference between the tow files.

diff ~/rclis/dbl/dbl.amf.xml ~izumi/rclis/dbl/dbl.amf.xml

If there is a new version, test it with

channels_dblp ~izumi/rclis/dbl/dbl.amf.xml

Look at the error and output files. Ignore any messages about CORR. If there are no other problems, safe the old version of the series file

cp ~/rclis/dbl/dbl.amf.xml ~/dblp/amf_backup.`date -I`

where you note the backticks used. They are on the extreme upper left of the American keyboard. date -I is the command to return the ISO date, and `date -I` returns the result of this command.

Then copy Izumi's version to yours

cp ~izumi/rclis/dbl/dbl.amf.xml ~/rclis/dbl/dbl.amf.xml

step 5

This step is optional, and in most cases you will skip it. You can run a complete check of Izumi's channel data against your current version of the data with

channels_dblp ~izumi/rclis/dbl/dbl.amf.xml ~/dblp/var/dblp.xml

Note that this is more a step for Izumi to deal with.

step 6

This is the main step. You can launch the converter

amf_from_dblp

and the you have a mug of coffee, and another mug of coffee, etc. It generates a complete run of AMF data from the ~/dblp/var/dblp.xml file. When this step is finished, the files and directories in the temporary area created by the process can be delete. This can safe a bit of disk space.

step 7

The konz project only works with the titles of the papers, not with the author names. Therefore, it uses a simple input file that just has the paper titles and the handles of the papers. To build that file, use

titlefile_konz project_name year

As you see, this command takes two arguments. First, there is the name of the project project_name. The name of the project is used as the file name for the title file. In fact, the title file it written into

~/var/titles/project_name.titles

The second argument is year, representing the year from which onwards papers should be included. This is useful for, for example, only working through recent files. Normally you take 1994 as the year, and the project title is rclis, thus you should use

titlefile_konz rclis 1994

Unfortunately, this script uses XML::XPath at this time and therefore is verry slow.

step 8

In this step we copy the new information, that we have just gathered, onto fasolt. The copying of the title file is the important part.

rsync -va ~/var/titles/rclis.titles arclis@fasolt:var/titles/rclis.titles

Then you should also copy the new rclis:dbl archive to fasolt. This is not strictly necessary but it does creates a backup.

rsync -va ~/rclis/dbl/ arclis@fasolt:rclis/dbl

Note the use of the slash at the end of the third element in the command line.

step 9

Relax now, you are done. Unless of course, you want to build a new version of DoCIS ;-).

fasolt

The best idea of what fasolt is doing is to look at the crontab ~/etc/crontab. These notes are just some additional explanations.

Konz work

Google searches

This really stage 0.

~/perl/google_title_konz number

is the script that makes the Google searches. This script takes a maximum number number of Google questions as its argument. Note that inside the script, there is an important refresh parameter that sets when an old query should be considered to be stale. The age is determined by the file date of the dump file that reports on the query. Such dump files are held in ~/var/google.

The ~/var/google directory must be backed up. It is backed up via a crontab entry to arclis@wotan. Currently the rsync does not have the --delete flag set.

In order not have to search the entire ~/var/google directory structure for papers, a list file is built with ~/perl/list_google_konz. This file is kept in ~/var/find/var%3Agoogle. This is a simple text file. This is used in the next stage.

Stage 1

Stage 1 is the downloading of pages that are found in Google.

~/perl/stage1_konz number

Here, the number refers the number of papers to deal with.

This stage makes heavy use of the file ~/etc/konz.conf. Especially the <redsites> is important, it has a list of sites where nothing is to be found. It is best not to download such sites as they may require a lot of space.

Stage 1 makes a decision about the type of file. If it looks like HTML, it is put into ~/public_html/cache_konz. Note that currently the distinction is made on file names, not on MIME types. In order not have to search the entire ~/public_html/cache_konz directory structure for papers, a list file is built with ~/perl/list_cache_konz. This file is kept in ~/var/find/cache%3Akonz. This is a simple text file. This is used in stage 2.

If the file is full-text, its doroga (path into konz) is determined. If the title of the paper matches the title of the Google hit, the doroga is "goti". If it does not, the doroga is "ligo". The document is saved in the appropriate sub-directory of public_html/store_konz.

Stage 2

Stage 2 is the downloading of futlis found in HTML pages in the ~/public_html/cache_konz.

~/perl/stage2_konz number

Here, the number refers the maximum number of wall clock minutes the script should run. After dealing with a paper, the process checks if it is over the maximum time, and dies if this the case.

The script reads HTML files out of the public_html/cache_konz and looks for links to the full-texts. These are called futlis.

It has special parsing function for arXiv.org as doroga "arxi". It has a special parsing function for CiteSeer as doroga "cilt".

For others, it reads the entire HTML page. It can find a futli in an anchor. This is the doroga "tian". It can find a single futli in the page. This the doroga "sisl". Or it can find multiple futlis. This is the doroga "musl". In that case, all are downloaded. I know this is inefficient, and costs a lot of disk space. Note that is because of musl futlis that the stage 2 for one paper can take a very long time.

In order not have to search the entire ~/public_html/store_konz directory structure for papers, a list file is built with ~/perl/list_store_konz. This file is kept in ~/var/find/store%3Akonz. This is a simple text file. This is used in stage 3.

To conserve space, and to avoid scanning the same files all over again, scanned HTML files are nullified. That means, they are replaced with a file of the same name but zero size. Note that they are not deleted. If they would be deleted, the would be downloaded again at a stage 1 running later.

Stage 3

Stage 2 is the downloading of futlis found in HTML pages in the ~/public_html/cache_konz.

Stage 3 is the examination of full-text files (fuffis) found in the ~/public_html/store_konz. It is invoked as

~/perl/stage3_konz number

Here, the number refers the maximum number of wall clock minutes the script should run. After dealing with a paper, the process checks if it is over the maximum time, and dies if this the case.

The converted text is saved in ~/public_html/text_konz, but for reasons of space economy it is destroyed quickly.

The default engine to convert to text is ps2text. This engine calls ghostscript. This is very resource intensive. There are other engines used. If the fuffi has the title close to the text top, it is saved in ~/public_html/papers_konz.

Stage 4

This stage is reserved for future scripts recognizing HTML files as fuffies.

Stage 5

Th stage removes potential duplicates out of ~/public_html/papers_konz to save disk space. Should be used with caution.

Stage 6

This stage creates AMF file snippets for use with ~/perl/amf_from_dblp out of ~/public_html/papers_konz. It is invoked as

~/perl/stage3_konz number

It does not use a list file, instead it reads the entire directory with /usr/bin/find. It is not called up by a crontab entry. But I am curious enough how many papers we have found so I go there and do it, sometimes several times a day. It is good to feel growth.

backups

Backup up fasolt is difficult because of two things. First fasolt lives at my home. The upload bandwidth that I have there is very small. Second, rsync, when running, creates a lot of i/o activity. It could be that such I/O ruins disks. I don't know, but it just may be. Therefore only selected stuff is being backed up. Refer to ~/etc/crontab.