Altai paper
- Status
- This paper documents the ernad software.
-
Thomas Krichel
started work on this paper in Novosibirsk, Russia in June
2004. This is the version of 2009–01–31.
-
The development of the implementation of the software described
here is funded by the Open Library Society. The software is
written by Roman Shapiro.
- Introduction
-
ernad stands for editing reports
on new academic documents.
-
ernad is written in Perl.
It is built for Debian GNU/Linux systems.
-
For the web interface, it uses mod_perl under Apache
2. Since mod_perl can not live with suexec, running ernad
requires an Apache installation that runs as the user ernad.
-
An version of ernad as a fastCGI with lighttp would
be desirable.
-
The home directory of an ernad installation is the "ernad"
subdirectory of the home of the user "ernad". Typically, it
will be
/home/ernad/ernad
. This directory is
referred to as home in the following.
-
ernad has been written with the Mailman email list software
in mind. It has no way to interact with the Mailman server.
All that it does is send mail to it. It is therefore
possible to use a different list server software. All that
is required of that software is that the domain name of the
homepage of each list is the domain of the list email
address, and that the last non-empty slash-delimited
component is equal to the local part of the address.
Thus if the list URL is http://foo.org/list/bar,
mail to it must go to bar@foo.org. If that is not the
case, ernad in it's current state can not deal with it.
- Representational issues
-
An ernad service uses four domains. These
are called the composition domain, the delivery
domain, the list domain and the service
domain.
-
The composition domain is used by the report composition
interface. This domain is specified in the Apache
configuration. It is the domain of the virtual host that
ernad uses. The Apache configuration lives in
home
/etc/apache
.
-
The service domain is where users of an ernad
implementation service access its service. It it fixed in
home
/etc/reports.amf.xml
, as
reported below.
-
The list domain is the domain where the mailing lists are
kept that the service uses. This domain is found in
home
/etc/reports.amf.xml
, as
reported below.
-
The delivery domain is fixed as the "delivery_domain"
parameter in home
/etc/ernad.conf
.
It is described later.
-
An ernad service is configured with a configuration file
that lives in
home
/etc/reports.amf.xml
. This is
an XML file written in AMF.
-
The service is represented as an AMF collection noun,
referred to as the "service noun" in the following.
In XML parlance, this is the single child of the
root element of
the XML contained in reports.amf.xml.
The service
noun has a <homepage> adjective. In XPath
notation,
/amf/collection/homepage/
.
The value of this
adjective is a URL. The domain part of that URL
determines the service domain.
-
An ernad service is composed of several reports.
Each report has its own report_id. The
report_id is the same as the name of the mailing list used
to circulate the report. Thus report_id
@
list_domain
is the mailing list to send the report issue for a
report with the report_id report_id.
-
The reports are represented by AMF collection
nouns. They are
are <hasparts> children of the service noun. The first
one listed is the "all" report, henceforth
referred to as the "allport". It operates
differently than the other reports.
The editor of the allport is called the general editor.
To distinguish the allport from other
reports, the other reports are known as
subject reports.
-
The report_id of a report is distinct from
the id attribute of the report noun.
-
Each report noun has a <homepage> adjective, that shows the
Mailman homepage URL of the list that circulates
the report. The last slash-delimited component (bar
a trailing slash)
of that URL is the report_id of the report.
From that Mailman
homepage URL, the address of the list is calculated.
Ernad takes the domain name part of the homepage URL
as the domain of the email address. It takes the
last slash-delimited component of the URL as
the report_id. The report_id is also used as
the local part in the email address of the
report. As an example, if
<homepage>http://lists.repec.org/mailman/listinfo/nep-acc</homepage>
, the email to the report will be sent to nep-acc@lists.repec.org.
The report_id is "nep-acc". But the report handle
in the wider dataset that uses report data may be, say
say "RePEc:nep:nepacc".
-
There is supposed to be a common domain name to
all the homepages of all reports, because there
is only one list domain. However, ernad
when reading
the home
/etc/reports.amf.xml
file, will assume that the
list domain is the domain associated with the first
report mentioned in homeetc/reports.amf.xml
, i.e. the
list domain name of the allport.
- The issue emails are sent from the
account that ernad runs under. The sender's
address is ernad @ the service domain. The
human part of the address has an ASCII version
of the editor's name. The English version
of the editor's name
is used, as specified in report.amf.xml. Thus
every editor as described in home
etc/reports.amf.xml
has to have an English version of the name.
Thus a typical from: header would be
From: Bernardo Batiz-Lazo <ernad@nep.repec.org>
Note that this address is compatible with the
following example. The service domain is used
in the From: header, not the list domain. In
the Reply-to header, the real email address
of the editor is used.
Thus a typical Reply-to: header would be
Reply-to: Bernardo Batiz-Lazo <batizb@bigbbl.uk>
- home
/etc/reports.amf.xml
also contains report passwords,
in the <ernad:password> attribute of the
collection noun
- To illustrate, here is an example
home/etc/reports.amf.xml configuration file
<amf xmlns="http://amf.openlib.org" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://amf.openlib.org http://amf.openlib.org/2001/amf.xsd" xmlns:ernad="http://ernad.openlib.org">
<collection id="RePEc:nep">
<title>NEP: New Economics Papers</title>
<accesspoint>ftp://netec.mcc.ac.uk/pub/NetEc/WWW/NEP/amf/nep</accesspoint>
<homepage>http://nep.repec.org</homepage>
<haspart>
<collection id="RePEc:nep:nepall">
<title>All new papers</title>
<homepage>http://lists.repec.org/mailman/listinfo/nep-all</homepage>
<ernad:password></ernad:password>
<haseditor>
<person>
<name>Bernado Bátiz-Lazo</name>
<name xml:lang="en">Bernado Batiz-Lazo</name>
<homepage></homepage>
<email>general_editor@nep.repec.org</email>
<ispartof>
<organization>
<name>South Bank Business School</name>
<homepage>http://business.lbsu.ac.uk</homepage>
</organization>
</ispartof>
</person>
</haseditor>
</collection>
</haspart>
<haspart>
<collection id="RePEc:nep:nepacc">
<title>Accounting</title>
<homepage>http://lists.repec.org/mailman/listinfo/nep-acc</homepage>
<ernad:password></ernad:password>
<haseditor>
<person>
<name>Bernardo Bátiz-Lazo</name>
<name xml:lang="en">Bernardo Batiz-Lazo</name>
<homepage></homepage>
<email>batizb@bigbbl.uk</email>
</person>
</haseditor>
</collection>
</haspart>
<haspart>
<collection id="RePEc:nep:nepafr">
<title>Africa</title>
<homepage>http://lists.repec.org/mailman/listinfo/nep-afr</homepage>
<ernad:password></ernad:password>
<haseditor>
<person>
<name>Suzanne McCoskey</name>
<name xml:lang="en">Suzanne McCoskey</name>
<homepage></homepage>
<email>mccoskey@dadoo.net</email>
</person>
</haseditor>
</collection>
</haspart>
<haspart>
</collection>
</amf>
- Directory structure
- The directory for monitoring new additions is
called the "directory to watch".
Only well-formed files with names ending in
.amf.xml are included in the gathering operation.
The directory to watch is defined in
home
/etc/ernad.conf
with the parameter PapersDir.
-
home
/etc
has configuration files.
-
home
/etc/reports.amf.xml
is the
configuration file for the reports. It describes
the service that ernad implements.
-
home
/etc/apache/
is
a copy of the "/etc/apache2" directory
of Debian, with configurations for
ernad.
-
home
/etc/bin/
a link to
to the script that fires
up Apache on the operating
system level, in /etc/init.d/
as well as a the costumized apache2ctl
script fired up by that operating system
level script.
-
home
/etc/ernad.conf
is the
general configuration file for the
software. The options set are
Option |
values |
description |
UniversalPassword | string |
This the password to login into any report interface. |
GeneralEditor | string |
The is the report_id of the allport. |
DebugMode | 0 or 1 |
If DebugMode is 1 then the final report issue
email is sent to editor's personal email address.
Otherwise it is sent to the email address of the report. |
DebugInfo | 0 or 1 |
If DebugInfo is 1 then the debug information is shown
while working with report interface. |
Links | string |
This is the delivery domain.
It must not start with protocol's name. That is,
"http://d.openlib.org" is wrong;
it must be "d.openlib.org". |
PapersDir | string |
This is the full path of the directory to watch. |
PapersFileRestrict | regular expression |
When home/perl/make_all.pl searches
the directory to watch, any filename that matches this Perl
regular expression, is excluded from processing. |
PapersIdRestrict | string |
When "make_all.pl" or "update_futli.pl" parses a paper
with an id that matches this Perl regular expression,
it is excluded. |
GESendMail | 0 or 1 |
If GESendMail is 1 then the issue of the allport is
sent to the general editor. |
ExperienceLength | string |
This gives the default experience length, as
yyyy- mm- dd.
Thus 0001-02-23 is one year, two months, and 23 days. Months
are counted as 30 days.
|
FutliCheckedDir | string |
This gives the directory for checked futlis. A "futli"
is a full-text link. |
RSSDir | string |
This gives the directory where RSS file
is written into. |
-
home/xsl/ contains the following style sheets
-
home/xsl/amf2html.xsl transforms a report in xml form to html.
-
home/xsl/amf2text.xsl transforms a report in xml form to text.
-
home/xsl/rhtml.xsl transforms a single paper node to html.
-
home/xsl/rtext.xsl transforms a single paper node to text.
-
home/xsl/futli.xsl is applied to xml containing
paper's full text links, when there's more then one of
them.
-
home/xsl/no_futli.xsl is applied to xml containing
paper's full text links, when there are no futlis at
all.
-
home/xsl/single_futli_direction.xsl is
applied to xml containing paper's full text links,
when there is only one futli. If this xsl file does not
exist, then n script simply redirects to that futli. This
file may not exist.
-
home/var has the report issue
files, in the structure as discussed below.
-
home
/style
contains the following documents
-
ernad_css.xml
- ernad CSS data, wrapped in a <css> XML element.
- login.xsl.xml
- login screen
issue.xsl.xml
- issues selection screen
papers.xsl.xml
- papers selection screen
sorting.xsl.xml
- sorting screen
final_html_mail.xsl.xml
- final html version as mailed
final_html_html.xsl.xml
- final html version as seen by the editor
final_text_text.xsl.xml
- final text version, first stage
final_text_html.xsl.xml
- HTML version of final text, for editor preview
All these files accept external
text parameters "report" and "password".
-
In the text generation stage, we have the problem taht
XSLT can not break lines. First,
final_text_text.xsl.xml
generates a text string $text
. Then
$text =~ s/!START!(.*?)!END!/FT( \$1 )/ges;
Perl instruction is used to split the text into lines,
where FT()
is a function defined in Final.pm
.
The result is called the mail string. It is also used for creating
the email. To
enable a preview of the mail in ernad, the mail string is wrapped
into a <pre>
element. Then
final_text_html.xsl.xml
is applied to that
LibXML object.
-
There are some further exceptions to the rule that
HTML and textual output of ernad is created through
XSLT. First, dates of the form yyyy-mm and
yyyy-mm-dd are translated to
yyyy–mm and yyyy–mm–dd, respectively.
-
Ernad generates a couple of so-called memory files, the
selectedfile and the weightsfile.
-
RIF structure
-
The RIF is the report issue file. It is an
AMF XML file that contains the data of a report.
-
The RIF copies the relevant report information
from home/etc/reports.amf.xml,
but only those elements that are in the AMF
name space. Thus the password is excluded, because
it is in the ernad namespace. In
addition, the collection gets in
id
that is the id
of the report
within the wider dataset, not the report_id,
with a colon and
the issue date of the report appended.
The information gathered from home/etc/reports.amf.xml
is known as the "header" of the RIF. The
texts appearing in the report issue
form the "body" of the RIF.
Texts appearing in the report appear like in the
AMF files that are in the document stock. However,
the id
attribute of the noun is replaced
by a ref
attribute.
-
Here is an extract out of an example
RIF.
<collection id="RePEc:nep:nepacc:2003-02-04">
<title>Accounting</title>
<homepage>http://lists.repec.org/mailman/listinfo/nep-acc</homepage>
<haseditor>
<person>
<name>Bernardo Bátiz-Lazo</name>
<name xml:lang="en">Bernardo Batiz-Lazo</name>
<homepage></homepage>
<email>batizb@bigbbl.uk</email>
</person>
</haseditor>
<-- here end the header part of the reports -->
<haspart>
<text ref="RePEc:xfx:dufduf:9801">
<title>El estudio de la contabilidad en una comarca de Zamora</title>
<hasauthor>
<person>
<name>Alfonso del Cuesto</name>
...
</person>
</hasauthor>
</text>
</haspart>
</collection>
- Operational time line for subject report editors
-
A RIF, at any stage, has a name of the form
yyyy-mm-dd_unix, where
yyyy is the year of the corresponding all issue,
mm is the month of the all issue, dd is the
day of the all issue, and unix is the unix time stamp
(i.e. number of seconds since 1970-01-01)
of the creation of the file.
RIF file names have the .amf.xml extension.
RIFs are
never modified. RIFs are almost never deleted.
Instead they are copied with time stamp change.
-
There are three exceptions when a RIF is deleted.
- If an unsorted RIF is presorted, but RIFs with the
same date already exists in the directory
where the presorted issues live, the existing
RIF is deleted.
-
If make_all is run, but RIFs for that date
already exists, they are being deleted.
-
When the copying of RIFs for the allport
RIF to the subject report directories occurs,
existing RIFs with the same date are deleted.
-
On the first screen, the editor
sees a list of available reports issues to work
on. The system offers an issue to work
only if pre-sorting for that issue has finished. That
means there are always two buttons for the
editor. One leads to the pre-sorted
RIF. The other one leads to the non-presorted
RIF.
-
To find out if an issue still needs
working on, ernad looks if there is a file
in the home/var/reports/report/sent
directory, where report is the report_id.
-
Thus, for an subject report report,
there is a directory report/source.
This directory has a "ps" and a "us" subdirectory.
If two RIFs can be found, with the same report
date, though not necessarily the same
time stamps, one in each sub-directory, the editor
is invited to work on the issue, as long as
there is no corresponding RIF in the
in the home/var/reports/report/sent
directory.
-
An editor can delete an issue to work on. When
this is done home/var/reports/report/created/us
and home/var/reports/report/created/ps are
not deleted. Instead an empty dummy sent RIF is created
in home/var/reports/report/sent.
-
If the editor
chooses to work with the pre-sorted version,
the RIF is copied with time stamp change,
from home/var/reports/report/source/ps.
Otherwise, if the editor chooses
to work with the unsorted version, the RIF
it is copied, with time stamp change,
from home/var/reports/report/source/us.
It is copied into
home/var/reports/report/created.
-
By default all papers are unselected. If the
editor has selected no papers, but she presses
the accept button, she is shown the selection
stage again. There a back button leads to
the issue selection stage. At the
issue selection stage, the editor can
delete the issue.
-
Once papers have been selected, a RIF
is copied with time stamp change
into
home/var/reports/report/selected.
This RIF only contains the selected papers.
-
Once papers in the report issue have been sorted,
the RIF is copied with time stamp change
into home/var/reports/report/ordered.
-
Once the RIF is in report/ordered has been done, the issue
can be previewed an HTML and as text.
-
One the RIF has been previewed, the editor
can sent the report. Under debugging, the
report is being sent to the editor's
email account. Under full operation, the
email is sent to the report mailing
address.
-
Operational time line for allport
-
When the script home/ernad/perl/make_all.pl is launched,
ernad software composes an initial RIF that contains
an allport. This is described in the next
section. In order to achieve easy analogy,
the gathering script stores an allport
in the same way as a normal subject issue.
-
When this is finished, the allport
editor logs in like a subject editor.
-
On the first screen, the allport editor
sees a list of available reports to work
on.
-
At the papers selection stage, the allport editor, by default, has every
paper selected. Once the allport editor has chosen the papers that
are to be included, the all report proceeds to sorting just like any
other report does.
-
Once the RIF is in home/var/reports/report/ordered has been done,
the issue can be previewed an HTML and as text.
-
One the RIF has been previewed, the allport editor
can sent the allport issue. Under debugging, the
report is being sent to the editor's
email account. Under real operation, the
email is sent the the all report mailing
address.
-
When the allport issue has been sent, the
general editor logs in to ernad machine, to
run home/perl/presorting.pl.
Copies of the allport RIF are
are created, with the appropriate handle
changes, from
home/var/reports/allport/source/us
to the home/var/reports/report/source/us
directories for all subject reports. Then presorting.pl does
the presorting. It ends by sending an email to
ever subject editor, saying that the
report is ready. The text of the message is hardcoded
in homeperl/presorting.pl.
- Full-text links gathering
-
It is inspired by the work of
Sune Karlsson.
-
The full-text link (henceforth: futli)
in a text can take
two forms. In XPath notation, they are
"/amf/text@id/file/url" and
"/amf/text@id/hasversion/text/file/url". These
are called "direct" and "indirect" futlis, respectively.
The fact that a futli is direct or indirect is
called the direction of the futli. In RePEc,
at this time, there are only direct futli.
-
The database of all the full-text links
known at a time is called the futli base. For
each text, it contains a
a vector of futlis for the text, as well
as their direction. It is held as a
Perl structure in the file
home/var/db/futli_base.dump.
- As part of home/perl/make_all.pl, ernad examines
files ending ".amf.xml" in the directory to
watch. It reports the values of the futlis, as
well as their direction,
and stores them in the futli base. The
futli base is home/var/db/futli_base.dump.
- The directory given in the
FutliCheckedDir parameter of home/etc/ernad.conf,
is supposed to contain a list of id's of documents
that have working futlis. Only those documents will
be included in the date base.
-
When home/perl/make_all starts, it fires up a script
update_checked_futli. Although not strictly
necessary, it stores the result is
home/var/db/checked_futli.dump as dumped Perl
hash.
- Gathering operation
- The script to create an all issue is
home/perl/make_all.pl. It takes an optional argument
of the form yyyy-mm-dd.
It checks if this is a valid date. If it is
is a valid date, the date is called the search
date. If it is not a valid date, the search
date is 1970-01-01.
- When make_all starts, it reads a Berkeley
database
of previously seen papers. This is at
home/var/db/papers_dates.db. It
stores paper handles and the dates at which
they were seen. This is known as the "date base".
-
The gathering operation reads every text record
found in the directory to watch, that is in a
file ending .amf.xml that has been more recently
changed than the search date.
- A text record is an
AMF <text> element, with an id attribute. It
looks up the value of the id attribute in the
date base. If the value can be found, it turns
to the next paper.
- Once a text XML is found in the
watch directory but the text is not in the
date base, an XPath reference is passed
to the Perl library home/perl/pre-date_base.pl.
It checks the XML contents and returns 1 if the text
should be included in the all report, 0 if not.
If the answer is yes, the current date is
added as the value of the id of the candidate
paper in the date base.
-
As part of the home/perl/pre-date_base.pl,
homeperl/make_all.pl examines the handles found in
and compares them to the list of good handles.
-
If the paper has been added the data base,
a second script home/perl/post-date_base.pl.
does a similar check. The idea is that
the second checks not repeated every time.
It is only performed only once. It checks
the XML contents and returns 1 if the text
should be included in the all report, 0 if not.
- Both home/perl/pre-date_base.pl
and home/perl/post-date_base.pl
have two parameters. First, the
filename when the AMF data is to be found,
and second, the XPath expression of the node
to be checked. Normally, this will be '/amf/text[1]'.
- Each implementor of ernad is free to build
their own homeernad/perl/Ernad/pre-date_base.pm", and
"post-date_base.pm". RePEc's NEP service requires
- For homeernad/perl/Ernad/pre-date_base.pm,
that the <type> of the text is
"preprint" and that the id does not
come from "RePEc:rus".
- For homeernad/perl/Ernad/post-date_base.pm,
that the >date< of the text,
if there is one, should be less than two years from
the current date.
-
The texts that remain at this point are included in
the allport issue.
- Futli downloads
- They are inspired by the work of
José
Manuel Barrueco Cruz.
-
futli downloads are likely to take place on another
Apache installation than the ernad installation.
The domain of that other Apache installation is the
so-called delivery domain. Differences between
report composition domain and delivery domain
are ok because the download does not
require write operations.
-
The script to perform the download is
home/ernad/perl/n.
-
The home/ernad/perl/n sends full-text to the users of
reports. It uses a file called the futli
base file. This contains the futli-base. This
file is read into memory when Apache is started.
-
home/ernad/perl/n takes the "u" argument for the id of the
text to be
downloaded. It takes the "r" argument for the
report code. It takes an optional "s" argument. If
the "s" argument is present, the
script redirects to the (encoded)
URL that is value of the "s" argument.
-
home/ernad/perl/n examines if a futli exists for the
text to be
downloaded. If there is no futli,
an XML document
<nofulti>id</nofulti>
is formed, where id is
the id of the text that the
user sought to download.
The style sheet home/xsl/no_futli.xsl is
applied to it.
-
If there is one futli, and it is of direction
direction, and the file
home/xsl/single_futli_direction.xsl
exists, it is applied to the
XML document
<futli><direction>
futli_URL </direction></futli>,
where futli_URL is the futli-URL
element.
If the style sheet can not be found, the
home/ernad/perl/n script sends a redirect to the futli.
-
A futli-URL element is of the form
<url>
<target>target</target>
<transport>transport</transport>
</url>
Here target is the URL of the futli and
transport is a URL of "n" invoking
itself, with an "s" argument and an "r" argument, to
redirect the user to the target URL.
-
If there are several futli of direction
direction, and the file
futli.xsl
exists, it will be applied to an
analogues piece of XML data, where
futli-URLs appears one after the other, but where
all URLs of the same direction are grouped.
Direct futlis appear in the XML document
before indirect futlis. If home/xsl/futli.xsl does
not exist, the script exits with an error.
-
Example: a text "m:p" has two futlis, "http://a", a
direct futli and
"http://b", an indirect one. Assume that
d.openlib.org is the host of the futli downloads.
Then the XML file
<futli><direct>
<url><target>http://a</target>
<transport>http://d.openlib.org/n?s=http%3A%2F%2Fa</transport>
</url> </direct><indirect>
<url><target>http://b</target>
<transport>http://d.openlib.org/n?s=http%3A%2F%2Fa</transport>
</url>
</indirect>
</futli>
is formed.
- Presorting
- Presorting uses an external program
called svm_light. There
are two executables, "svm_learn" and "svm_classify".
The location of these is given in the "SVM_lightDir"
variable set in home/etc/ernad.conf".
- The experience length is the maximum time
period for which information that is available
will be considered. The experience length
is configured in a variable "ExperienceLength"
in homeetc/ernad.conf. It
takes the usual date form yyyy-mm-dd, but it is
interpreted as length of time. Thus a value
0001-01-00 requests that reports older
than 13 months be excluded from consideration
in training and testing.
-
The experience start is either the earliest
allport issue date available or the current
date minus the experience length, whatever
comes later.
-
Updates to the training, model, and test files are
only done when a new allport issue arrives.
They are done as part of home/perl/presorting.pl.
Never mind that when an editor has just
finished a report issue, the subject intelligence
would be helpful to pre-sort the next issue,
which she may prepare immediately after.
-
The svm training file for the allport
has lines of the form
"yyyy-mm-dd svm_light_data
# id\n" when
yyyy-mm-dd is the issue date of the
paper appearing in the allport, svm_light_data
is the data as required by svm_light and
id is the id of the paper.
It lives in
home/var/reports/allport/svm_light/yyyy-mm-dd.train,
where yyyy-mm-dd is the date associated with
the allport issue, and allport is the
id of the allport. The svm training file
for the allport is not used to train anything. It
is only used as a library to generate the
training files for the subject reports.
When a new allport issue has been made
this file is renewed with data from allport
issues from the experience start to the
date of the allport issue.
-
The svm_light data is approximately created as
follows. The abstracts, author names, classifications,
keywords, series and titles are used. The values
of these field are concatenated and all punctuation is removed.
Then all values are converted to lowercase. This results in a
list of features. We count the number of appearance of each feature
within a paper as its weight. We normalise every paper using an
Euclidean distance of one. Then this data is prepared according
to the syntax rules of svm_light.
-
The svm training file for a subject report
issue has lines of the form
"yn svm_light_data # id\n" where
yn takes the value "+1" or "-1", svm_light_data
is the data as required by svm_light and
id is the id of the paper. It lives in
home/var/reports/report/svm_light/yyyy-mm-dd.train,
where yyyy-mm-dd is the date associated with
the report issue, and report is the
report_id of the report. When a new allport training
file has been renewed, the svm training files
for subject reports are calculated by
home/perl/presorting.pl.
-
The calculation of svm training file proceeds
as follows. The final RIFs for the report
that have an issue date between the current
date and the experience start are scanned
for handles of papers that have been included.
Then the svm training file for the allport
is scanned. Any paper that has a date before
the first issue date of the report is ignored.
Any paper that has a date after the
the issue date of the last issue of the
report is ignored. All other papers that
have a handle that is found in the final
RIF of the report have their yn set to
"+1". All others have their yn set to "-1".
-
Once the calculation of the svm training file
is completed, the file
homevar/reports/report/svm_light/yyyy-mm-dd.train,
is trained, and the model file is
homevar/reports/report/svm_light/yyyy-mm-dd.model.
-
The svm model file for a subject report issue lives in
home/var/reports/reports/svm_light/yyyy-mm-dd.model,
where yyyy-mm-dd is the date associated with
the report issue, and report is the
report_id of the report. When a new svm training
file has been created, the svm model files
for subject reports are created immediately
after. The utility used is svm_learn.
-
The svm testing file for an allport issue
has lines of the form "svm_light_data # id\n" where
svm_light_data is the data as required by svm_light and
id is the id of the paper. It lives in
home/var/report/allport/svm_light/yyyy-mm-dd.test,
where yyyy-mm-dd is the date associated with
the allport issue, and allport is the
name of the allport. The svm testing file contains
the lines of the svm training file for the
current allport issue, with the date in the
line removed. Thus it contains duplicate information
that is also available in the allport training
file. It is just there for ease of computation.
-
The svm model file for a subject report issue lives in
var/reports/report/svm_light/yyyy-mm-dd.model,
where yyyy-mm-dd is the date associated with
the report issue, and report is the
name of the report. When a new svm training
file has been created, the svm model files
for subject reports are created immediately
after. The utility used is svm_learn.
-
The svm classification file for a subject report
issue lives in homevar/reports/report/svm_light/yyyy-mm-dd.class,
where yyyy-mm-dd is the date associated with
the report issue, and report is the
report_id of the report. When a new svm model
file has been created, the svm model files
for subject reports are created immediately
after. The utility used is svm_classify. The
classification file contains the classification
values of all the papers in the svm testing file
for the allport, in the order which they are
available in the svm testing file for the allport
issue. The handles of papers can be gleaned from
that file.
-
If the presorting operation does not work,
because the test file, model file don't
exist, the classification file can not be
created, the svm_classify utility can not
be found or that it exists with an error,
the pre-sorted RIF will be the same as the non
pre-sorted RIF.
- Apache
-
There are (usually) two apache servers, one is
called the main server, running on port 80, and
the other is called the hidden server. They
use a separate infrastructure that avoids collision.
-
The exception to this separateness are the logs,
they are still in
/var/log/apache2
.
-
home
/etc/apache/
is a file to be
placed in /etc/init.d/
. It makes sure
that a hidden server is fired up.
-
The main
server passes request to some virtual hosts,
call the proxy virtual hosts, to virtual
hosts on the hidden server. The configuration
files for the proxy virtual hosts live in
home
/etc/apache/sites-available
.
In order to work, the main server points to the
these files as links from its sites-enabled
directory, typically called
/etc/apache2/sites-enabled
.
All these configurations do is pass traffic
to the hidden server.
-
The apache root directory, containing all
the configurations for the hidden server
lives in
home
/etc/apache/
,
-
The configuration
files for the virtual hosts of the
hidden server proxy hosts live in
home
/etc/apache/sites-available
,
but contrary to the proxy virtual hosts,
they are symlinked from
home/etc/apache/sites-enabled
.
-
To fire up an Apache2 server on a Debian machine,
/etc/init.d/apache2
is used. This
in turn fires up /usr/sbin/apache2ctl
.
Both have to be customised. The customised
version of the /etc/init.d/apache2
lives in /etc/init.d/apache2ernad
.
A symlink from home/etc/apache/bin
is provided. The customised
version of the /usr/sbin/apache2ctl
lives in home/bin/apache2ernad
.
Appendix: some of the contents of ernad software
amf/
has a link to the
AMF data that ernad is looking at for document data.
doc/
contains ernad documentation in HTML format.
etc/
contains the configuration files
etc/apache
has apache2 configuration files
for the web server that is owned by the user ernad.
etc/ernad.conf
has the system configuration.
etc/reports.amf.xml
has the reports configuration.
html/
contains HTML files that are required
by the system.
html/no_issues.html
says
that there are no issues for the editor to edit.
html/no_more_issues.html
says
that the same thing after the editors has worked to
finish an issue.
perl/
contains the Perl scripts and modules.
perl/Ernad
contains the Perl modules.
perl/Ernad/Checking.pm
implements
pre-datebase,post-datebase and check_futli functionality.
perl/Ernad/Common.pm
has common functions.
perl/Ernad/Contants.pm
contains constants like directory or file names.
perl/Ernad/Dates.pm
has functions to work with dates.
perl/Ernad/Final.pm
has functions to work and generate
the html/text/send stage.
perl/Ernad/Generate.pm
has functions for generating the allport issue.
perl/Ernad/Globals.pm
contains functions for the startup. It is called when
the main apache starts up.
perl/Ernad/Issue.pm
contains functions
for the issue selection process.
perl/Ernad/Login.pm
contains functions for
the login to the system. This is
the first module that is activated when a user connects.
perl/Ernad/Main.pl
is the main module that determines the
current state and calls other modules.
perl/Ernad/Papers.pm
contains
functions for the paper selection stage.
perl/Ernad/ReportState.pm
contains functions
for sorting, select, deleting papers in issues.
perl/Ernad/Sorting.pm
contains functions
for sorting papers in issues.
perl/delete_old_reports
is a utility that deletes all reports
with a certain date, yyyy-mm-dd. The date has to be given as date thus
delete_old_reports.pl 2005-02-29
perl/index.pl
is the file that is run where there is a
connection to the ernad site.
perl/list_reports.pl
is a utility that lists all reports.
perl/make_all.pl
is run at the command line by the general
editor to make the allport issue.
perl/n.pl
is the download script.
It is called as n in the browser, therefore we have a soft link
n
to it.
perl/presorting.pl
creates the presorted report.
perl/processreports.pl
creates directories
that are required by the report. It has to be run
every time a new report is added in reports.amf.xml
.
perl/startup_for_download.pl
is the mod_perl startup file
for the apache server that serves fulltext downloads.
perl/startup_for_main.pl
is the mod_perl startup
file for the main ernad server, i.e., the one owned
by the user ernad.
var/
has the output of the system
var/db/
has databases.
var/db/futli_base.dump
is a Perl dump of the
futli base.
var/db/papers_dates.dump
is a Perl dump of the
date base.
var/db/reports
has the data files
for all reports.
Changes related to seabro
The Login.pm script creates a nucleus of the forem
<ernad/>
. It is transformed with
login.xsl.xml.
The Issues.pm script create a nucleus of the form
<ernad>
<issues>
<issue>
yyyy-mm-dd</issue>
<issue>
yyyy-mm-dd</issue>
</issues>
</ernad>
or, if there are no issues
<ernad>
<report>
report</report>
<password>
password</password>
<issues/>
</ernad>
The parameters "report" and "password" are passed to every XSLT
file. The parament "allport" is passed to the papers selection stage
xslt file papers.xsl.xml. The "allport" parameter takes the value
"yes" if the report is the allport. In that case all checkboxes are
checked. The "allport" parameter takes the value "no" if the report
is not the allport. In that case no checkboxes are checked.
html/no_issues.html
and
html/no_more_issues.html
will be obsoleted.
Roman's text format
XSLT capable of generating text has to format the text into
a special intermediate format known as "Roman's format".
It is for the form
!START!
digits!
contents!END!
Blank lines are to be kep blank. Other lines will have the text
contents formated into lines that are at least
65 chars long, with each line indented by digit
of whitespace. It the start of contents is of the form
digits. it is allowed to move into the indented space.