Altai paper

  1. Status
    1. This paper documents the ernad software.
    2. Thomas Krichel started work on this paper in Novosibirsk, Russia in June 2004. This is the version of 2009–01–31.
    3. The development of the implementation of the software described here is funded by the Open Library Society. The software is written by Roman Shapiro.
  2. Introduction
    1. ernad stands for editing reports on new academic documents.
    2. ernad is written in Perl. It is built for Debian GNU/Linux systems.
    3. For the web interface, it uses mod_perl under Apache 2. Since mod_perl can not live with suexec, running ernad requires an Apache installation that runs as the user ernad.
    4. An version of ernad as a fastCGI with lighttp would be desirable.
    5. The home directory of an ernad installation is the "ernad" subdirectory of the home of the user "ernad". Typically, it will be /home/ernad/ernad. This directory is referred to as home in the following.
    6. ernad has been written with the Mailman email list software in mind. It has no way to interact with the Mailman server. All that it does is send mail to it. It is therefore possible to use a different list server software. All that is required of that software is that the domain name of the homepage of each list is the domain of the list email address, and that the last non-empty slash-delimited component is equal to the local part of the address. Thus if the list URL is http://foo.org/list/bar, mail to it must go to bar@foo.org. If that is not the case, ernad in it's current state can not deal with it.
  3. Representational issues
    1. An ernad service uses four domains. These are called the composition domain, the delivery domain, the list domain and the service domain.
      • The composition domain is used by the report composition interface. This domain is specified in the Apache configuration. It is the domain of the virtual host that ernad uses. The Apache configuration lives in home/etc/apache.
      • The service domain is where users of an ernad implementation service access its service. It it fixed in home/etc/reports.amf.xml, as reported below.
      • The list domain is the domain where the mailing lists are kept that the service uses. This domain is found in home/etc/reports.amf.xml, as reported below.
      • The delivery domain is fixed as the "delivery_domain" parameter in home/etc/ernad.conf. It is described later.
    2. An ernad service is configured with a configuration file that lives in home/etc/reports.amf.xml. This is an XML file written in AMF.
    3. The service is represented as an AMF collection noun, referred to as the "service noun" in the following. In XML parlance, this is the single child of the root element of the XML contained in reports.amf.xml. The service noun has a <homepage> adjective. In XPath notation, /amf/collection/homepage/. The value of this adjective is a URL. The domain part of that URL determines the service domain.
    4. An ernad service is composed of several reports. Each report has its own report_id. The report_id is the same as the name of the mailing list used to circulate the report. Thus report_id@list_domain is the mailing list to send the report issue for a report with the report_id report_id.
    5. The reports are represented by AMF collection nouns. They are are <hasparts> children of the service noun. The first one listed is the "all" report, henceforth referred to as the "allport". It operates differently than the other reports. The editor of the allport is called the general editor. To distinguish the allport from other reports, the other reports are known as subject reports.
    6. The report_id of a report is distinct from the id attribute of the report noun.
    7. Each report noun has a <homepage> adjective, that shows the Mailman homepage URL of the list that circulates the report. The last slash-delimited component (bar a trailing slash) of that URL is the report_id of the report. From that Mailman homepage URL, the address of the list is calculated. Ernad takes the domain name part of the homepage URL as the domain of the email address. It takes the last slash-delimited component of the URL as the report_id. The report_id is also used as the local part in the email address of the report. As an example, if <homepage>http://lists.repec.org/mailman/listinfo/nep-acc</homepage> , the email to the report will be sent to nep-acc@lists.repec.org. The report_id is "nep-acc". But the report handle in the wider dataset that uses report data may be, say say "RePEc:nep:nepacc".
    8. There is supposed to be a common domain name to all the homepages of all reports, because there is only one list domain. However, ernad when reading the home/etc/reports.amf.xml file, will assume that the list domain is the domain associated with the first report mentioned in homeetc/reports.amf.xml, i.e. the list domain name of the allport.
    9. The issue emails are sent from the account that ernad runs under. The sender's address is ernad @ the service domain. The human part of the address has an ASCII version of the editor's name. The English version of the editor's name is used, as specified in report.amf.xml. Thus every editor as described in homeetc/reports.amf.xml has to have an English version of the name. Thus a typical from: header would be
      From: Bernardo Batiz-Lazo <ernad@nep.repec.org>
      Note that this address is compatible with the following example. The service domain is used in the From: header, not the list domain. In the Reply-to header, the real email address of the editor is used. Thus a typical Reply-to: header would be
      Reply-to: Bernardo Batiz-Lazo <batizb@bigbbl.uk>
    10. home/etc/reports.amf.xml also contains report passwords, in the <ernad:password> attribute of the collection noun
    11. To illustrate, here is an example home/etc/reports.amf.xml configuration file
      <amf xmlns="http://amf.openlib.org" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
              xsi:schemaLocation="http://amf.openlib.org http://amf.openlib.org/2001/amf.xsd" xmlns:ernad="http://ernad.openlib.org">
              <collection id="RePEc:nep">
              <title>NEP: New Economics Papers</title>
              <accesspoint>ftp://netec.mcc.ac.uk/pub/NetEc/WWW/NEP/amf/nep</accesspoint>
              <homepage>http://nep.repec.org</homepage>
              <haspart>
              <collection id="RePEc:nep:nepall">
              <title>All new papers</title>
              <homepage>http://lists.repec.org/mailman/listinfo/nep-all</homepage>
              <ernad:password></ernad:password>
              <haseditor>
              <person>
              <name>Bernado Bátiz-Lazo</name>
              <name xml:lang="en">Bernado Batiz-Lazo</name>
              <homepage></homepage>
              <email>general_editor@nep.repec.org</email>
              <ispartof>
              <organization>
              <name>South Bank Business School</name>
              <homepage>http://business.lbsu.ac.uk</homepage>
              </organization>
              </ispartof>
              </person>
              </haseditor>
              </collection>
              </haspart>
              <haspart>
              <collection id="RePEc:nep:nepacc">
              <title>Accounting</title>
              <homepage>http://lists.repec.org/mailman/listinfo/nep-acc</homepage>
              <ernad:password></ernad:password>
              <haseditor>
              <person>
              <name>Bernardo Bátiz-Lazo</name>
              <name xml:lang="en">Bernardo Batiz-Lazo</name>
              <homepage></homepage>
              <email>batizb@bigbbl.uk</email>
              </person>
              </haseditor>
              </collection>
              </haspart>
              <haspart>
              <collection id="RePEc:nep:nepafr">
              <title>Africa</title>
              <homepage>http://lists.repec.org/mailman/listinfo/nep-afr</homepage>
              <ernad:password></ernad:password>
              <haseditor>
              <person>
              <name>Suzanne McCoskey</name>
              <name xml:lang="en">Suzanne McCoskey</name>
              <homepage></homepage>
              <email>mccoskey@dadoo.net</email>
              </person>
              </haseditor>
              </collection>
              </haspart>
              <haspart>
              </collection>
              </amf>
              
  4. Directory structure
    1. The directory for monitoring new additions is called the "directory to watch". Only well-formed files with names ending in .amf.xml are included in the gathering operation. The directory to watch is defined in home/etc/ernad.conf with the parameter PapersDir.
    2. home/etc has configuration files.
      • home/etc/reports.amf.xml is the configuration file for the reports. It describes the service that ernad implements.
      • home/etc/apache/ is a copy of the "/etc/apache2" directory of Debian, with configurations for ernad.
      • home/etc/bin/ a link to to the script that fires up Apache on the operating system level, in /etc/init.d/ as well as a the costumized apache2ctl script fired up by that operating system level script.
      • home/etc/ernad.conf is the general configuration file for the software. The options set are
        Option values description
        UniversalPasswordstring This the password to login into any report interface.
        GeneralEditorstring The is the report_id of the allport.
        DebugMode0 or 1 If DebugMode is 1 then the final report issue email is sent to editor's personal email address. Otherwise it is sent to the email address of the report.
        DebugInfo0 or 1 If DebugInfo is 1 then the debug information is shown while working with report interface.
        Linksstring This is the delivery domain. It must not start with protocol's name. That is, "http://d.openlib.org" is wrong; it must be "d.openlib.org".
        PapersDirstring This is the full path of the directory to watch.
        PapersFileRestrictregular expression When home/perl/make_all.pl searches the directory to watch, any filename that matches this Perl regular expression, is excluded from processing.
        PapersIdRestrictstring When "make_all.pl" or "update_futli.pl" parses a paper with an id that matches this Perl regular expression, it is excluded.
        GESendMail0 or 1 If GESendMail is 1 then the issue of the allport is sent to the general editor.
        ExperienceLengthstring This gives the default experience length, as yyyy-mm-dd. Thus 0001-02-23 is one year, two months, and 23 days. Months are counted as 30 days.
        FutliCheckedDirstring This gives the directory for checked futlis. A "futli" is a full-text link.
        RSSDirstring This gives the directory where RSS file is written into.
    3. home/xsl/ contains the following style sheets
      • home/xsl/amf2html.xsl transforms a report in xml form to html.
      • home/xsl/amf2text.xsl transforms a report in xml form to text.
      • home/xsl/rhtml.xsl transforms a single paper node to html.
      • home/xsl/rtext.xsl transforms a single paper node to text.
      • home/xsl/futli.xsl is applied to xml containing paper's full text links, when there's more then one of them.
      • home/xsl/no_futli.xsl is applied to xml containing paper's full text links, when there are no futlis at all.
      • home/xsl/single_futli_direction.xsl is applied to xml containing paper's full text links, when there is only one futli. If this xsl file does not exist, then n script simply redirects to that futli. This file may not exist.
    4. home/var has the report issue files, in the structure as discussed below.
    5. home/style contains the following documents
      ernad_css.xml
      ernad CSS data, wrapped in a <css> XML element.
      login.xsl.xml
      login screen
      issue.xsl.xml
      issues selection screen
      papers.xsl.xml
      papers selection screen
      sorting.xsl.xml
      sorting screen
      final_html_mail.xsl.xml
      final html version as mailed
      final_html_html.xsl.xml
      final html version as seen by the editor
      final_text_text.xsl.xml
      final text version, first stage
      final_text_html.xsl.xml
      HTML version of final text, for editor preview
      All these files accept external text parameters "report" and "password".
    6. In the text generation stage, we have the problem taht XSLT can not break lines. First, final_text_text.xsl.xml generates a text string $text. Then $text =~ s/!START!(.*?)!END!/FT( \$1 )/ges; Perl instruction is used to split the text into lines, where FT() is a function defined in Final.pm. The result is called the mail string. It is also used for creating the email. To enable a preview of the mail in ernad, the mail string is wrapped into a <pre> element. Then final_text_html.xsl.xml is applied to that LibXML object.
    7. There are some further exceptions to the rule that HTML and textual output of ernad is created through XSLT. First, dates of the form yyyy-mm and yyyy-mm-dd are translated to yyyy–mm and yyyy–mm–dd, respectively.
    8. Ernad generates a couple of so-called memory files, the selectedfile and the weightsfile.
  5. RIF structure
    1. The RIF is the report issue file. It is an AMF XML file that contains the data of a report.
    2. The RIF copies the relevant report information from home/etc/reports.amf.xml, but only those elements that are in the AMF name space. Thus the password is excluded, because it is in the ernad namespace. In addition, the collection gets in id that is the id of the report within the wider dataset, not the report_id, with a colon and the issue date of the report appended. The information gathered from home/etc/reports.amf.xml is known as the "header" of the RIF. The texts appearing in the report issue form the "body" of the RIF. Texts appearing in the report appear like in the AMF files that are in the document stock. However, the id attribute of the noun is replaced by a ref attribute.
    3. Here is an extract out of an example RIF.
                    <collection id="RePEc:nep:nepacc:2003-02-04">
                    <title>Accounting</title>
                    <homepage>http://lists.repec.org/mailman/listinfo/nep-acc</homepage>
                    <haseditor>
                    <person>
                    <name>Bernardo Bátiz-Lazo</name>
                    <name xml:lang="en">Bernardo Batiz-Lazo</name>
                    <homepage></homepage>
                    <email>batizb@bigbbl.uk</email>
                    </person>
                    </haseditor>
                    <-- here end the header part of the reports -->
                    <haspart>
                    <text ref="RePEc:xfx:dufduf:9801">
                    <title>El estudio de la contabilidad en una comarca de Zamora</title>
                    <hasauthor>
                    <person>
                    <name>Alfonso del Cuesto</name>
                    ...
                    </person>
                    </hasauthor>
                    </text>
                    </haspart>
                    </collection>
                  
  6. Operational time line for subject report editors
    1. A RIF, at any stage, has a name of the form yyyy-mm-dd_unix, where yyyy is the year of the corresponding all issue, mm is the month of the all issue, dd is the day of the all issue, and unix is the unix time stamp (i.e. number of seconds since 1970-01-01) of the creation of the file. RIF file names have the .amf.xml extension. RIFs are never modified. RIFs are almost never deleted. Instead they are copied with time stamp change.
    2. There are three exceptions when a RIF is deleted.
      1. If an unsorted RIF is presorted, but RIFs with the same date already exists in the directory where the presorted issues live, the existing RIF is deleted.
      2. If make_all is run, but RIFs for that date already exists, they are being deleted.
      3. When the copying of RIFs for the allport RIF to the subject report directories occurs, existing RIFs with the same date are deleted.
    3. On the first screen, the editor sees a list of available reports issues to work on. The system offers an issue to work only if pre-sorting for that issue has finished. That means there are always two buttons for the editor. One leads to the pre-sorted RIF. The other one leads to the non-presorted RIF.
    4. To find out if an issue still needs working on, ernad looks if there is a file in the home/var/reports/report/sent directory, where report is the report_id.
    5. Thus, for an subject report report, there is a directory report/source. This directory has a "ps" and a "us" subdirectory. If two RIFs can be found, with the same report date, though not necessarily the same time stamps, one in each sub-directory, the editor is invited to work on the issue, as long as there is no corresponding RIF in the in the home/var/reports/report/sent directory.
    6. An editor can delete an issue to work on. When this is done home/var/reports/report/created/us and home/var/reports/report/created/ps are not deleted. Instead an empty dummy sent RIF is created in home/var/reports/report/sent.
    7. If the editor chooses to work with the pre-sorted version, the RIF is copied with time stamp change, from home/var/reports/report/source/ps. Otherwise, if the editor chooses to work with the unsorted version, the RIF it is copied, with time stamp change, from home/var/reports/report/source/us. It is copied into home/var/reports/report/created.
    8. By default all papers are unselected. If the editor has selected no papers, but she presses the accept button, she is shown the selection stage again. There a back button leads to the issue selection stage. At the issue selection stage, the editor can delete the issue.
    9. Once papers have been selected, a RIF is copied with time stamp change into home/var/reports/report/selected. This RIF only contains the selected papers.
    10. Once papers in the report issue have been sorted, the RIF is copied with time stamp change into home/var/reports/report/ordered.
    11. Once the RIF is in report/ordered has been done, the issue can be previewed an HTML and as text.
    12. One the RIF has been previewed, the editor can sent the report. Under debugging, the report is being sent to the editor's email account. Under full operation, the email is sent to the report mailing address.
  7. Operational time line for allport
    1. When the script home/ernad/perl/make_all.pl is launched, ernad software composes an initial RIF that contains an allport. This is described in the next section. In order to achieve easy analogy, the gathering script stores an allport in the same way as a normal subject issue.
    2. When this is finished, the allport editor logs in like a subject editor.
    3. On the first screen, the allport editor sees a list of available reports to work on.
    4. At the papers selection stage, the allport editor, by default, has every paper selected. Once the allport editor has chosen the papers that are to be included, the all report proceeds to sorting just like any other report does.
    5. Once the RIF is in home/var/reports/report/ordered has been done, the issue can be previewed an HTML and as text.
    6. One the RIF has been previewed, the allport editor can sent the allport issue. Under debugging, the report is being sent to the editor's email account. Under real operation, the email is sent the the all report mailing address.
    7. When the allport issue has been sent, the general editor logs in to ernad machine, to run home/perl/presorting.pl. Copies of the allport RIF are are created, with the appropriate handle changes, from home/var/reports/allport/source/us to the home/var/reports/report/source/us directories for all subject reports. Then presorting.pl does the presorting. It ends by sending an email to ever subject editor, saying that the report is ready. The text of the message is hardcoded in homeperl/presorting.pl.
  8. Full-text links gathering
    1. It is inspired by the work of Sune Karlsson.
    2. The full-text link (henceforth: futli) in a text can take two forms. In XPath notation, they are "/amf/text@id/file/url" and "/amf/text@id/hasversion/text/file/url". These are called "direct" and "indirect" futlis, respectively. The fact that a futli is direct or indirect is called the direction of the futli. In RePEc, at this time, there are only direct futli.
    3. The database of all the full-text links known at a time is called the futli base. For each text, it contains a a vector of futlis for the text, as well as their direction. It is held as a Perl structure in the file home/var/db/futli_base.dump.
    4. As part of home/perl/make_all.pl, ernad examines files ending ".amf.xml" in the directory to watch. It reports the values of the futlis, as well as their direction, and stores them in the futli base. The futli base is home/var/db/futli_base.dump.
    5. The directory given in the FutliCheckedDir parameter of home/etc/ernad.conf, is supposed to contain a list of id's of documents that have working futlis. Only those documents will be included in the date base.
    6. When home/perl/make_all starts, it fires up a script update_checked_futli. Although not strictly necessary, it stores the result is home/var/db/checked_futli.dump as dumped Perl hash.
  9. Gathering operation
    1. The script to create an all issue is home/perl/make_all.pl. It takes an optional argument of the form yyyy-mm-dd. It checks if this is a valid date. If it is is a valid date, the date is called the search date. If it is not a valid date, the search date is 1970-01-01.
    2. When make_all starts, it reads a Berkeley database of previously seen papers. This is at home/var/db/papers_dates.db. It stores paper handles and the dates at which they were seen. This is known as the "date base".
    3. The gathering operation reads every text record found in the directory to watch, that is in a file ending .amf.xml that has been more recently changed than the search date.
    4. A text record is an AMF <text> element, with an id attribute. It looks up the value of the id attribute in the date base. If the value can be found, it turns to the next paper.
    5. Once a text XML is found in the watch directory but the text is not in the date base, an XPath reference is passed to the Perl library home/perl/pre-date_base.pl. It checks the XML contents and returns 1 if the text should be included in the all report, 0 if not. If the answer is yes, the current date is added as the value of the id of the candidate paper in the date base.
    6. As part of the home/perl/pre-date_base.pl, homeperl/make_all.pl examines the handles found in and compares them to the list of good handles.
    7. If the paper has been added the data base, a second script home/perl/post-date_base.pl. does a similar check. The idea is that the second checks not repeated every time. It is only performed only once. It checks the XML contents and returns 1 if the text should be included in the all report, 0 if not.
    8. Both home/perl/pre-date_base.pl and home/perl/post-date_base.pl have two parameters. First, the filename when the AMF data is to be found, and second, the XPath expression of the node to be checked. Normally, this will be '/amf/text[1]'.
    9. Each implementor of ernad is free to build their own homeernad/perl/Ernad/pre-date_base.pm", and "post-date_base.pm". RePEc's NEP service requires
      • For homeernad/perl/Ernad/pre-date_base.pm, that the <type> of the text is "preprint" and that the id does not come from "RePEc:rus".
      • For homeernad/perl/Ernad/post-date_base.pm, that the >date< of the text, if there is one, should be less than two years from the current date.
    10. The texts that remain at this point are included in the allport issue.
  10. Futli downloads
    1. They are inspired by the work of José Manuel Barrueco Cruz.
    2. futli downloads are likely to take place on another Apache installation than the ernad installation. The domain of that other Apache installation is the so-called delivery domain. Differences between report composition domain and delivery domain are ok because the download does not require write operations.
    3. The script to perform the download is home/ernad/perl/n.
    4. The home/ernad/perl/n sends full-text to the users of reports. It uses a file called the futli base file. This contains the futli-base. This file is read into memory when Apache is started.
    5. home/ernad/perl/n takes the "u" argument for the id of the text to be downloaded. It takes the "r" argument for the report code. It takes an optional "s" argument. If the "s" argument is present, the script redirects to the (encoded) URL that is value of the "s" argument.
    6. home/ernad/perl/n examines if a futli exists for the text to be downloaded. If there is no futli, an XML document <nofulti>id</nofulti> is formed, where id is the id of the text that the user sought to download. The style sheet home/xsl/no_futli.xsl is applied to it.
    7. If there is one futli, and it is of direction direction, and the file home/xsl/single_futli_direction.xsl exists, it is applied to the XML document <futli><direction> futli_URL </direction></futli>, where futli_URL is the futli-URL element. If the style sheet can not be found, the home/ernad/perl/n script sends a redirect to the futli.
    8. A futli-URL element is of the form <url> <target>target</target> <transport>transport</transport> </url> Here target is the URL of the futli and transport is a URL of "n" invoking itself, with an "s" argument and an "r" argument, to redirect the user to the target URL.
    9. If there are several futli of direction direction, and the file futli.xsl exists, it will be applied to an analogues piece of XML data, where futli-URLs appears one after the other, but where all URLs of the same direction are grouped. Direct futlis appear in the XML document before indirect futlis. If home/xsl/futli.xsl does not exist, the script exits with an error.
    10. Example: a text "m:p" has two futlis, "http://a", a direct futli and "http://b", an indirect one. Assume that d.openlib.org is the host of the futli downloads. Then the XML file <futli><direct> <url><target>http://a</target> <transport>http://d.openlib.org/n?s=http%3A%2F%2Fa</transport> </url> </direct><indirect> <url><target>http://b</target> <transport>http://d.openlib.org/n?s=http%3A%2F%2Fa</transport> </url> </indirect> </futli> is formed.
  11. Presorting
    1. Presorting uses an external program called svm_light. There are two executables, "svm_learn" and "svm_classify". The location of these is given in the "SVM_lightDir" variable set in home/etc/ernad.conf".
    2. The experience length is the maximum time period for which information that is available will be considered. The experience length is configured in a variable "ExperienceLength" in homeetc/ernad.conf. It takes the usual date form yyyy-mm-dd, but it is interpreted as length of time. Thus a value 0001-01-00 requests that reports older than 13 months be excluded from consideration in training and testing.
    3. The experience start is either the earliest allport issue date available or the current date minus the experience length, whatever comes later.
    4. Updates to the training, model, and test files are only done when a new allport issue arrives. They are done as part of home/perl/presorting.pl. Never mind that when an editor has just finished a report issue, the subject intelligence would be helpful to pre-sort the next issue, which she may prepare immediately after.
    5. The svm training file for the allport has lines of the form "yyyy-mm-dd svm_light_data # id\n" when yyyy-mm-dd is the issue date of the paper appearing in the allport, svm_light_data is the data as required by svm_light and id is the id of the paper. It lives in home/var/reports/allport/svm_light/yyyy-mm-dd.train, where yyyy-mm-dd is the date associated with the allport issue, and allport is the id of the allport. The svm training file for the allport is not used to train anything. It is only used as a library to generate the training files for the subject reports. When a new allport issue has been made this file is renewed with data from allport issues from the experience start to the date of the allport issue.
    6. The svm_light data is approximately created as follows. The abstracts, author names, classifications, keywords, series and titles are used. The values of these field are concatenated and all punctuation is removed. Then all values are converted to lowercase. This results in a list of features. We count the number of appearance of each feature within a paper as its weight. We normalise every paper using an Euclidean distance of one. Then this data is prepared according to the syntax rules of svm_light.
    7. The svm training file for a subject report issue has lines of the form "yn svm_light_data # id\n" where yn takes the value "+1" or "-1", svm_light_data is the data as required by svm_light and id is the id of the paper. It lives in home/var/reports/report/svm_light/yyyy-mm-dd.train, where yyyy-mm-dd is the date associated with the report issue, and report is the report_id of the report. When a new allport training file has been renewed, the svm training files for subject reports are calculated by home/perl/presorting.pl.
    8. The calculation of svm training file proceeds as follows. The final RIFs for the report that have an issue date between the current date and the experience start are scanned for handles of papers that have been included. Then the svm training file for the allport is scanned. Any paper that has a date before the first issue date of the report is ignored. Any paper that has a date after the the issue date of the last issue of the report is ignored. All other papers that have a handle that is found in the final RIF of the report have their yn set to "+1". All others have their yn set to "-1".
    9. Once the calculation of the svm training file is completed, the file homevar/reports/report/svm_light/yyyy-mm-dd.train, is trained, and the model file is homevar/reports/report/svm_light/yyyy-mm-dd.model.
    10. The svm model file for a subject report issue lives in home/var/reports/reports/svm_light/yyyy-mm-dd.model, where yyyy-mm-dd is the date associated with the report issue, and report is the report_id of the report. When a new svm training file has been created, the svm model files for subject reports are created immediately after. The utility used is svm_learn.
    11. The svm testing file for an allport issue has lines of the form "svm_light_data # id\n" where svm_light_data is the data as required by svm_light and id is the id of the paper. It lives in home/var/report/allport/svm_light/yyyy-mm-dd.test, where yyyy-mm-dd is the date associated with the allport issue, and allport is the name of the allport. The svm testing file contains the lines of the svm training file for the current allport issue, with the date in the line removed. Thus it contains duplicate information that is also available in the allport training file. It is just there for ease of computation.
    12. The svm model file for a subject report issue lives in var/reports/report/svm_light/yyyy-mm-dd.model, where yyyy-mm-dd is the date associated with the report issue, and report is the name of the report. When a new svm training file has been created, the svm model files for subject reports are created immediately after. The utility used is svm_learn.
    13. The svm classification file for a subject report issue lives in homevar/reports/report/svm_light/yyyy-mm-dd.class, where yyyy-mm-dd is the date associated with the report issue, and report is the report_id of the report. When a new svm model file has been created, the svm model files for subject reports are created immediately after. The utility used is svm_classify. The classification file contains the classification values of all the papers in the svm testing file for the allport, in the order which they are available in the svm testing file for the allport issue. The handles of papers can be gleaned from that file.
    14. If the presorting operation does not work, because the test file, model file don't exist, the classification file can not be created, the svm_classify utility can not be found or that it exists with an error, the pre-sorted RIF will be the same as the non pre-sorted RIF.
  12. Apache
    1. There are (usually) two apache servers, one is called the main server, running on port 80, and the other is called the hidden server. They use a separate infrastructure that avoids collision.
    2. The exception to this separateness are the logs, they are still in /var/log/apache2.
    3. home/etc/apache/ is a file to be placed in /etc/init.d/. It makes sure that a hidden server is fired up.
    4. The main server passes request to some virtual hosts, call the proxy virtual hosts, to virtual hosts on the hidden server. The configuration files for the proxy virtual hosts live in home/etc/apache/sites-available. In order to work, the main server points to the these files as links from its sites-enabled directory, typically called /etc/apache2/sites-enabled. All these configurations do is pass traffic to the hidden server.
    5. The apache root directory, containing all the configurations for the hidden server lives in home/etc/apache/,
    6. The configuration files for the virtual hosts of the hidden server proxy hosts live in home/etc/apache/sites-available, but contrary to the proxy virtual hosts, they are symlinked from home/etc/apache/sites-enabled.
    7. To fire up an Apache2 server on a Debian machine, /etc/init.d/apache2 is used. This in turn fires up /usr/sbin/apache2ctl. Both have to be customised. The customised version of the /etc/init.d/apache2 lives in /etc/init.d/apache2ernad. A symlink from home/etc/apache/bin is provided. The customised version of the /usr/sbin/apache2ctl lives in home/bin/apache2ernad.

Appendix: some of the contents of ernad software

amf/ has a link to the AMF data that ernad is looking at for document data.

doc/ contains ernad documentation in HTML format.

etc/ contains the configuration files
etc/apache has apache2 configuration files for the web server that is owned by the user ernad.
etc/ernad.conf has the system configuration.
etc/reports.amf.xml has the reports configuration.

html/ contains HTML files that are required by the system.
html/no_issues.html says that there are no issues for the editor to edit.
html/no_more_issues.html says that the same thing after the editors has worked to finish an issue.

perl/ contains the Perl scripts and modules.
perl/Ernad contains the Perl modules.
perl/Ernad/Checking.pm implements pre-datebase,post-datebase and check_futli functionality.
perl/Ernad/Common.pm has common functions.
perl/Ernad/Contants.pm contains constants like directory or file names.
perl/Ernad/Dates.pm has functions to work with dates.
perl/Ernad/Final.pm has functions to work and generate the html/text/send stage.
perl/Ernad/Generate.pm has functions for generating the allport issue.
perl/Ernad/Globals.pm contains functions for the startup. It is called when the main apache starts up.
perl/Ernad/Issue.pm contains functions for the issue selection process.
perl/Ernad/Login.pm contains functions for the login to the system. This is the first module that is activated when a user connects.
perl/Ernad/Main.pl is the main module that determines the current state and calls other modules.
perl/Ernad/Papers.pm contains functions for the paper selection stage.
perl/Ernad/ReportState.pm contains functions for sorting, select, deleting papers in issues.
perl/Ernad/Sorting.pm contains functions for sorting papers in issues.
perl/delete_old_reports is a utility that deletes all reports with a certain date, yyyy-mm-dd. The date has to be given as date thus delete_old_reports.pl 2005-02-29
perl/index.pl is the file that is run where there is a connection to the ernad site.
perl/list_reports.pl is a utility that lists all reports.
perl/make_all.pl is run at the command line by the general editor to make the allport issue.
perl/n.pl is the download script. It is called as n in the browser, therefore we have a soft link n to it.
perl/presorting.pl creates the presorted report.
perl/processreports.pl creates directories that are required by the report. It has to be run every time a new report is added in reports.amf.xml.
perl/startup_for_download.pl is the mod_perl startup file for the apache server that serves fulltext downloads.
perl/startup_for_main.pl is the mod_perl startup file for the main ernad server, i.e., the one owned by the user ernad.

var/ has the output of the system
var/db/ has databases.
var/db/futli_base.dump is a Perl dump of the futli base.
var/db/papers_dates.dump is a Perl dump of the date base.
var/db/reports has the data files for all reports.

Changes related to seabro

The Login.pm script creates a nucleus of the forem <ernad/>. It is transformed with login.xsl.xml.

The Issues.pm script create a nucleus of the form

<ernad>
<issues>
<issue>yyyy-mm-dd</issue>
<issue>yyyy-mm-dd</issue>
</issues>
</ernad>

or, if there are no issues

<ernad>
<report>report</report>
<password>password</password>
<issues/>
</ernad>

The parameters "report" and "password" are passed to every XSLT file. The parament "allport" is passed to the papers selection stage xslt file papers.xsl.xml. The "allport" parameter takes the value "yes" if the report is the allport. In that case all checkboxes are checked. The "allport" parameter takes the value "no" if the report is not the allport. In that case no checkboxes are checked.

html/no_issues.html and html/no_more_issues.html will be obsoleted.

Roman's text format

XSLT capable of generating text has to format the text into a special intermediate format known as "Roman's format". It is for the form

!START!digits!contents!END!

Blank lines are to be kep blank. Other lines will have the text contents formated into lines that are at least 65 chars long, with each line indented by digit of whitespace. It the start of contents is of the form digits. it is allowed to move into the indented space.

Valid XHTML 1.0!