Kiel paper

Thomas Krichel and Andrei Pottossin

Status

This paper proposes an archiving protocol for documents in AMF-based digital library.
Thomas Krichel started work on this paper in Novosibirsk, Russia on 2008‒01‒09.
This is the version of 2008‒06‒05.

Aims and limitations

This proposal is based on the AMF metadata framework. It aims to store AMF data and the documents pointed to by displaypage and file/url attributes of text nouns.
This version of the protocol is really only a test to find out what are the problem with a running implementation. It is not supposed to be a final implementation of a preserved digital library.
An item is anything that is stored by a running implementation of the protocol.
There is no consideration here if the collection of items is legal or desirable.
This specification is a first try. It does not differentiate between a product as build by an implementation, and the software that is used to build it. In other words, this specification depends on software. It is not a specification that can be implemented with any software.
There is no consideration on how items will be made available to users. In particular, there is no concern if it is legal show them to anyone. It is assumed that the storage will be for the benefit of everybody. If items stored are to be made available to the public, and there is an exception raised by somebody, it will be up to the publishing service to think about a way to withdraw it. As far as this protocol is concerned, objects are stored in a dark archive. The fact that someone may get access to it is not a concern of the protocol.
The protocol assumes that there is a lot of disk space available. Disk place is assumed to be quite cheap.
The protocol does not store any metadata. The ojective of this version of the protocol is to find out what metadata needs to be collected to build a system that is “usable” in some sense.

Definitions and structure

Items come in three types
  1. “Records” are AMF records.
  2. “Objects” are resources linked to from records, usually to be found on the web.
  3. “Notes” are metadata that the system may create to document what it has done. Notes are related to records and to objects. In this version of the Kiel protocol there are notes.
Apart form items, the sysetm uses
  1. “Tools” files that the system itself uses for its work.
  2. configuration
The protocol does not work with database products for the storage of record and objects. The reason for that is that such products may no longer available in the very long run. Thus all items are stored as files in directories.
Records, objects and notes are stored in a directory that pertains to a record. This is known as the record directory. For each record, a protocol implementation system will have to come up with a record directory strategy, in short, a redist. Here, we propose a redist for RePEc in the appendix.
Time is measured in the Unix time stamp, henceforth referred to as the tist, in decimal form.
The renewal time is the time when a new effort is made to store a record and objects pertaining to this record. For each record, a last renewal time exits. Is is stored in the record directory in the file "00last". The file contains the tist of the last attempt to store the record an objects of the record, as well as the The 00last file contains the tist and a trailing newline char. The line will contain the start tist, followed by the end tist.
When a new record appear, it is called a virgin record. Its request tist in the tistbase is set to zero.
Records are stored in the record directory by the tist of creation or renewal. The name of the file, is the tist.amf.xml. If the current version of the record differs from the one that is on file, it is stored and becomes the current version of the record. By different, we mean if the bit structure of the file is different. There is no interpretation of semantic difference. Thus if the new version just has on additional whitespace, it is stored as a new file.
Object renewals only concern the current version of the record. Object referred in past versions are ignored.
A branch is an XPath expression from the record to the object that is of interest. All possible branches are given in the installation configuration.
A piece of AMF may contain several records. The record noun is the AMF noun that contains the identified record. An AMF data file in a record directory may contain only one AMF record.
The configuration of interests is the same for all records. They are collections of XPath expressions. The expressions are stored in a single file home/etc/branches.xml. The root element is <branches> It contains child elements <branch>. Each branch contains xpath attribute that is a string that used on Since the preservation only applies itself to /amf/text nouns, the branch does not contain /amf/text. It is added by the software. A check_branches script checks whether the branches are syntactically correct XPath expressions.
It is recommended that a user stores differnt versions of home/etc/branches.xml in home/etc/branches_tist.xml where tist is the tist of the last moment the version of branches was used.
The software uses a home directory from the user that runs it. This is refered to a home here.
All executable componets are written in Perl and live in a directory home/Perl.
For an object, at any renewal time, the protocol is interested in its status, type and body. The status is a three letter code available from http. The protocol escapes the URL of the object and creates a directory. This is the object directory associated with the object. In that object directory, there are files tist_ status_type the type is the mime-type of the object, with / replaced by a -. The status is the http response status at the time of th request. This is the time the request stated. The file will contain the body if there is one, or be empty otherwise.
If an object is redirected, a redirection record is written in the object directory, and a new record directory is creed for the URL the system redirects too.
Every time there as an effort to renew an object, the type, code and tist of the object is stored in the tist base. If the body exists, and if the object has not changed from the body of a previous version, the object file with the new tist is deleted. A symbolic link is created to conserve diskspace.
There is a front machine, and two back machines.
time it was successfully renew last time time it was renewed and changed time
The configuration of the termination stragey is based in a configuration file. The full age of an oject is the distance between the currentt time and the time the most recent instantatiton of teh oject on file. This is the control age. THe full age of an oject is the distance betweenthe current time and the time a oldes tistationa of the ojjct has appead in thestore. the termination stragey can only depend on a limited number of variable t The termination strategy has to be adjusted by hand, it is not dependent on the available dis space. In priciple, if the disk space in the store increases, we would like to be more lenient be more lenient with the terminantion stragegc. Maybe not.

to do

A redist for RePEc

see morton protocol

Valid XHTML 1.0!