Kiel paper
Thomas Krichel and Andrei Pottossin
Status
This paper proposes an archiving protocol for
documents in AMF-based digital library.
Thomas Krichel
started work on this paper in Novosibirsk, Russia on
2008‒01‒09.
This is the version of 2008‒06‒05.
Aims and limitations
This proposal is based on the
AMF
metadata framework. It aims to store AMF
data and the documents pointed to by
displaypage
and
file/url attributes of text nouns.
This version of the protocol is really only a test to find out what
are the problem with a running implementation. It is
not supposed to be a final implementation of a preserved digital
library.
An item is anything that is stored by a running implementation of
the protocol.
There is no consideration here if the collection of items is legal
or desirable.
This specification is a first try. It does not differentiate
between a product as build by an implementation, and the
software that is used to build it. In other words, this
specification depends on software. It is not a specification
that can be implemented with any software.
There is no consideration on how items will be made available to
users. In particular, there is no concern if it is legal show them
to anyone. It is assumed that the storage will be for the benefit of
everybody. If items stored are to be made available to the public,
and there is an exception raised by somebody, it will be up to the
publishing service to think about a way to withdraw it. As far as
this protocol is concerned, objects are stored in a dark
archive. The fact that someone may get access to it is not a concern
of the protocol.
The protocol assumes that there is a lot of
disk space available. Disk place is assumed to be quite
cheap.
The protocol does not store any metadata. The ojective
of this version of the protocol is to find out what metadata
needs to be collected to build a system that is “usable” in
some sense.
Definitions and structure
Items come in three types
-
“Records” are AMF records.
-
“Objects” are resources linked to from records,
usually to be found on the web.
-
“Notes” are metadata that the system may
create to document what it has done. Notes
are related to records and to objects. In
this version of the Kiel protocol there
are notes.
Apart form items, the sysetm uses
-
“Tools” files that the system itself uses
for its work.
-
configuration
The protocol does not work with database products for the
storage of record and objects. The reason for that is that such
products may no longer available in the very long run. Thus all
items are stored as files in directories.
Records, objects and notes are stored in a directory that
pertains to a record. This is known as the record directory. For
each record, a protocol implementation system will have to come
up with a record directory strategy, in short, a redist. Here,
we propose a redist for RePEc in the appendix.
Time is measured in the Unix time stamp, henceforth
referred to as the tist, in decimal form.
The renewal time is the time when a new effort is made to store
a record and objects pertaining to this record. For each record,
a last renewal time exits. Is is stored in the record directory
in the file "00last". The file contains the tist of the last
attempt to store the record an objects of the record, as well as
the The 00last file contains the tist and a trailing newline
char. The line will contain the start tist, followed by the end
tist.
When a new record appear, it is called a virgin
record. Its request tist in the tistbase is set to zero.
Records are stored in the record directory by the tist of
creation or renewal. The name of the file, is the
tist.amf.xml. If the current version of
the record differs from the one that is on file, it is stored
and becomes the current version of the record. By different, we
mean if the bit structure of the file is different. There is no
interpretation of semantic difference. Thus if the new version
just has on additional whitespace, it is stored as a new file.
Object renewals only concern the current version of the
record. Object referred in past versions are ignored.
A branch is an XPath expression from the record to the object
that is of interest. All possible branches are given in the
installation configuration.
A piece of AMF may contain several records. The record noun is
the AMF noun that contains the identified record. An AMF data
file in a record directory may contain only one AMF record.
The configuration of interests is the same for all
records. They are collections of XPath expressions.
The expressions are stored in a single file
home/etc/branches.xml. The root element is
<branches> It contains child elements
<branch>. Each branch contains
xpath attribute that is a string that used on
Since the preservation only applies itself to
/amf/text nouns, the branch
does not contain /amf/text. It is
added by the software. A check_branches script
checks whether the branches are syntactically
correct XPath expressions.
It is recommended that a user stores differnt versions of
home/etc/branches.xml in
home/etc/branches_tist.xml
where tist is the tist of the last
moment the version of branches was used.
The software uses a home directory from the user
that runs it. This is refered to a home
here.
All executable componets are written in Perl and
live in a directory home/Perl.
For an object, at any renewal time, the
protocol is interested in its status,
type and body. The status is a three
letter code available from http.
The protocol escapes the URL of the object
and creates a directory. This is the object
directory associated with the object.
In that object directory, there are files
tist_
status_type
the type is the mime-type of the object,
with / replaced by a -.
The status is the http response status at the
time of th request. This is the time
the request stated.
The file will contain the body if
there is one, or be empty otherwise.
If an object is redirected, a redirection record
is written in the object directory, and a new
record directory is creed for the URL the system
redirects too.
Every time there as an effort to renew an object, the type, code and
tist of the object is stored in the tist base. If the body exists,
and if the object has not changed from the body of a previous version,
the object file with the new tist is deleted. A symbolic link is created to
conserve diskspace.
There is a front machine, and two back machines.
time it was successfully renew last time
time it was renewed and changed
time
The configuration of the termination stragey
is based in a configuration file.
The full age of an oject is the distance
between the currentt time and the time
the most recent instantatiton of teh oject
on file. This is the control age.
THe full age of an oject is the distance
betweenthe current time and the
time a oldes tistationa of the
ojjct has appead in thestore.
the termination stragey can only depend on
a limited number of variable
t
The termination strategy has to be adjusted
by hand, it is not dependent on the available
dis space.
In priciple, if the disk space in the store
increases, we would like to be more lenient
be more lenient with the terminantion stragegc.
Maybe not.
to do
- status codes for ftp
- say what happens to objects when the record changes
- think of age and termination strategy
A redist for RePEc
see morton protocol